Product News
Announcing Cloud Insights for Amazon Web Services

The Internet Report

2024 Outage Trends Solidify; Plus OpenAI & Meta Outages

By Mike Hicks
| | 16 min read
Internet Report on Apple Podcasts Internet Report on Spotify Internet Report on SoundCloud

Summary

With close to a year of data available, the topline outage trends for 2024 are coming into focus. Hear what the numbers are showing and also unpack recent OpenAI and Meta outages.


This is the Internet Report, where we analyze outages and trends across the Internet through the lens of ThousandEyes Internet and Cloud Intelligence. I’ll be here every other week, sharing the latest outage numbers and highlighting a few interesting outages. As always, you can read the full analysis below or listen to the podcast for firsthand commentary.


Internet Outages & Trends

Over the years, we’ve discussed the growing importance (and ubiquity) of cloud for digitally-enabled enterprises. This is reflected in outage numbers: Essentially, as demand for cloud services increases and cloud footprints expand, there’s more potential for things to go awry. We’ve observed this over many years in the network and Internet service provider (ISP) space—and while ISP-centric outages continue to be the most prevalent, cloud service provider (CSP) outages have been closing that gap in 2024.

There are two key factors driving cloud expansion, and both relate to the application workloads that run on cloud-based infrastructure. First, application architectures are increasingly distributed, utilizing a mix of owned and unowned infrastructure services and components that are orchestrated together to enable end-to-end transaction delivery. Second, as resource-intensive capabilities like generative AI are added to more applications, it is placing an added strain on data center and cloud capacity.

Application owners continue to optimize their code, along with the mix of cloud and network resources they consume, in order to meet growing user demands. ISPs and CSPs are similarly making changes: adding to their infrastructure footprints, while constantly reconfiguring their environments to support that growing infrastructure.

This steady stream of software-based changes—to application codebases and the configurations of underlying infrastructure—is leading to an uptick in configuration errors that cause degradations or outages. For example, configuration errors appeared to lead to a recent OpenAI outage when the company attempted a change that overloaded key Kubernetes infrastructure.

We’ll unpack this and more in this week’s podcast and blog—our final episode for 2024. Read on or use the links below to jump to the sections that most interest you:


Join the ThousandEyes Internet Intelligence team—hosts of The Internet Report podcast—as we unpack the notable outages and performance degradations of 2024. Save your spot today!

Outage Trends Across 2024

In recent years, Internet service providers (ISPs) have accounted for the vast majority of outages because, frankly, there are so many of them powering so many diverse routes crisscrossing the globe.

After ISPs, the second most prevalent source of outages is cloud service providers (CSPs). As cloud footprints and service provider options expand, a greater number of issues are occurring, which translate into customer-facing impacts.

Eagle-eyed readers of this blog series will know that we’ve been tracking the ratio of ISP to CSP outages over many years, but that it was clear from early this year that 2024 looked different. Back in January, we noted a “slight shift in the ratio of ISP to CSP outages, changing from 89:11 in 2022 to 83:17 in 2023.” By mid-year, we could see this wasn’t an anomalous blip, with the “ratio rebalance accelerating significantly to 73:27.” This was, however, based on five months of data—not a full year.

With close to a full year of data now under our belt, that ISP:CSP outage ratio of 73:27 has held for the better part of 2024. While we’re awaiting final December numbers to make it official, it’s now all but confirmed that CSP outages, as a percentage of outages from both ISPs and CSPs, has risen from 17% to 27% in the space of a year. Conversely, the percentage of outages attributed to ISPs has decreased, resulting in an ISP outage ratio of 73%, down from 83% at the same point last year.

Chart showing the ratio of ISP vs. CSP outages across 2024
Figure 1. Ratio of ISP vs. CSP outages across 2024

It should be noted that ISPs aren’t experiencing fewer outages in total. Overall, outage numbers continue to climb, as more infrastructure is being added to the digital ecosystem. Instead, the shift that we’re seeing is that CSP outages are now becoming more frequent, indicating a changing landscape in service reliability.

The recent observations regarding ISP outages also suggest a significant shift in how these disruptions are managed. While there are still issues in transit infrastructure that have widespread impacts on applications and users, these are becoming the exception rather than the rule. Specifically with regional providers, a majority of ISP-related outages tend to have a very localized impact, on a specific subset of on-net users. More broadly, a number of ISP outage incidents coincided with scheduled maintenance windows—as the time of day in which they occur would indicate. Others are unscheduled but well-contained. A network failure in one area no longer directly leads to widespread repercussions in other regions, as was often the case in previous years.

In my opinion, this change reflects the ongoing transition of ISP networks as they incorporate more software-defined approaches. When ISPs now make configuration or other changes, they have a better knowledge of the likely flow-on effects of the change, allowing them to contain it. Additionally, ISPs continue to have the benefit of more alternate traffic paths than ever, and being software-defined allows them to switch traffic from degraded to healthy routes more quickly, again limiting any real negative impacts.

Upon our analysis, it's clear that while network outages remain a reality and are likely to persist, the frequency of these events is increasingly overshadowed by functional failures within the systems. This trend highlights the importance of understanding the different dynamics at play in network reliability, emphasizing that not all disruptions stem from traditional outages. As technology continues to evolve, the focus may need to shift towards addressing these functional issues to enhance overall network resilience.

It’s also clear that the rising proportion of CSP outages will require careful observation and management. Applications, such as those leveraging generative AI, increasingly demand considerable cloud resources in order to run and operate effectively. This is driving significant cloud (and network) capacity expansion to cope with the extra demand. Looking ahead, this likely means even more CSP outages in 2025—with flow-on impacts to applications and to the users of those applications—something we’ll be keeping close tabs on.

OpenAI Outage

Speaking of outages caused by configuration errors, let's take a look at the OpenAI outage that happened on December 11. Starting at 11:15 PM (UTC), OpenAI users reported “difficulties logging in to platform.openai.com and ChatGPT,” and API calls were returning errors. This expanded to all of OpenAI’s services—including the company’s new Sora video generation model—which were rendered “unavailable.” Services started to be restored around 1:00 AM (UTC); although, it took over four hours to mitigate the problems completely.


Explore the OpenAI outage further in the ThousandEyes platform (no login required).

ThousandEyes observed issues with loading site content during the outage, which indicates problems with the backend application. These issues manifested primarily as HTTP 403 errors following the initial redirect direction. The HTTP 403 error occurs after authentication has taken place and signifies that, while the server understands the request, it is unable to fulfill it. Several factors could cause an HTTP 403 status code, but it primarily points to issues stemming from the backend service. Additionally, ThousandEyes detected no network issues reaching the ChatGPT frontend web servers during the incident, further suggesting that the problems originated within the service's backend.

ThousandEyes screenshot of OpenAI outage showing partial page load and request for further information met with HTTP 403 response
Figure 2. Partial page load and request for further information is met with HTTP 403 response

An official post-mortem of the incident confirmed this, noting that the issue “stemmed from a new telemetry service deployment” that “unintentionally overwhelmed the Kubernetes control plane, causing cascading failures across critical systems.”

OpenAI runs many Kubernetes clusters around the world. Kubernetes has two main parts: the control plane, which manages the cluster, and the data plane, which runs tasks like model inference.

The introduction of the service impacted various areas, leading each node in every cluster to execute extensive Kubernetes API operations. The workload increased as the size of the cluster grew. With thousands of nodes performing these operations simultaneously, the Kubernetes API servers became overwhelmed, causing the control plane to fail in most of the larger clusters.

Failed API calls and login attempts resulted in user-facing impacts: All system resources were made inaccessible due to the control plane being down. The post-mortem adds that “remediation was very slow because of the locked out effect.” The organization is looking at “break-glass” mechanisms and decoupling Kubernetes components to aid recovery and service operation in the future.

Meta Outage

On December 11, Meta experienced an outage that affected users' ability to access several of its services, including Facebook and Instagram. The disruption began around 5:55 PM (UTC). The outage manifested as internal server errors and timeouts for some users across multiple regions while they were attempting to interact with the services. This indicates that the issues originated within Meta's backend services. This conclusion is further supported by the observation that, during the disruption, network connectivity to Meta's frontend web servers remained unaffected.

ThousandEyes screenshot showing all regions affected during Meta outage; issue manifested as HTTP 500 & server timeout errors
Figure 3. All regions affected; the issue manifested as HTTP 500 and server timeout errors


Explore the Meta outage further in the ThousandEyes platform (no login required).

ThousandEyes observed a decrease in the number of web application components that loaded for users during the outage. Parts of a page or application failing to load is a common occurrence in application outages. Normally in those situations, the page or application appears to load faster because it’s loading fewer components, but the Meta outage was different. While some page components failed to load, the load time actually kept rising to the point where it became excessive and resulted in timeouts.

Screenshot showing drops in availability, fewer page components loaded, increased page load time, but no network issues
Figure 4. Drops in availability coincide with fewer page components loaded and increased page load time, but no network issues

ThousandEyes observed a ‘castellation’ effect as availability and impacted servers appeared to continually rise and fall. This castellation pattern suggests an intermittent impact on Meta’s users as the outage progressed.

ThousandEyes screenshot showing the rise and fall of impacted servers during the Meta outage, indicating intermittent effects
Figure 5. Rise and fall of impacted servers, indicating intermittent effects

Although users could access the platform's main entry points, many encountered challenges when trying to reach essential features such as messages, posts, and updates. A significant portion of the user base found these functionalities to be consistently slow or completely unreachable, affecting both the desktop and mobile versions of the platform.


By the Numbers

In addition to our earlier discussion of the 2024 outage trends, let’s close with our usual deep dive into the global trends that ThousandEyes observed over the last two weeks (December 2 - 15) across ISPs, cloud service provider networks, collaboration app networks, and edge networks.

  • The total number of global outages exhibited an upward trend during this period. In the first week, ThousandEyes observed a significant 48% increase in outages, rising from 130 to 192. This upward trend continued into the following week (December 9-15), with the number of outages increasing from 192 to 205, marking an 8% rise compared to the previous week.

  • This upward trend was not fully reflected in the outages observed in the United States. During the first full week of December (December 2-8), outages increased significantly, rising by 84%. However, the following week saw a reversal, with outages decreasing from 123 to 109, which represents an 11% decrease compared to the previous week.

  • From December 2 to December 15, an average of 58% of all network outages occurred in the United States. This marks a slight increase from the 54% reported during the previous period, November 18 to December 1. This period continues a recent trend in which U.S.-centric outages have surpassed 50% of total outages for three consecutive reporting periods. It also continues a broader trend observed throughout much of 2024, where U.S.-centric outages typically accounted for at least 40% of all reported outages.

Bar chart showing global and U.S. network outage trends over eight recent weeks from October 21 through December 15
Figure 6. Global and U.S. network outage trends over eight recent weeks

Subscribe to the ThousandEyes Blog

Stay connected with blog updates and outage reports delivered while they're still fresh.

Upgrade your browser to view our website properly.

Please download the latest version of Chrome, Firefox or Microsoft Edge.

More detail