This is the Internet Report, where we analyze outages and trends across the Internet, from the previous two weeks, through the lens of ThousandEyes Internet and Cloud Intelligence. I’ll be here every other week, sharing the latest outage numbers and highlighting a few interesting outages. As always, you can read the full analysis below or tune in to the podcast for firsthand commentary.
Internet Outages & Trends
The past few weeks feel somewhat like a representative sample of what 2024 looked like from an outage perspective, with connectivity issues and updates at the root of the four significant incidents.
Both DigitalOcean and real-time payments provider Worldline experienced connectivity issues to data centers that made services unreachable. The latter incident, in particular, highlights the importance of ensuring the resilience of financial services. As we highlighted recently, the EU’s Digital Operational Resilience Act (DORA) is on the cusp of coming into effect. The continued occurrence of disruptions, even where the root cause is accidental, indicates that there is still work to do on infrastructure resilience.
Meanwhile, Microsoft and Reddit faced issues after making changes to their systems that had unforeseen user impacts and had to be rolled back. The Microsoft incident has some particular lessons for ITOps teams. At first, the issues were initially intermittent—a topic we have discussed at length before. Presenting as laggy or slow performance, intermittent issues can be hard to pinpoint or replicate. However, if a baseline can be established that shows what optimal performance looks like, then deviations like lagging performance can be detected more easily.
It only takes one component—or even a single function—to fail or degrade, potentially halting the entire service delivery chain. When a disruption occurs, it's crucial to identify the source efficiently, and a vital step in this process is recognizing what isn't causing the problem. By combining various signals, ITOps teams can start to gain a clearer understanding of the outage’s cause, allowing them to quickly decide on the next steps and, more importantly, communicate effectively with users.
This visibility is also increasingly critical to determining what intervention to make, and whether that intervention is manual or (as it is increasingly) automated. During the recent Microsoft 365 issues, the initial intervention appeared to make the problem worse before it got better. ITOps teams need data and insight at their disposal to understand how an intervention might land. That can be used in planning and decision-making accordingly.
Read on to learn about all the outages and degradation trends, or use the links below to jump to the sections that most interest you:
DigitalOcean's Network Issues
On November 27, DigitalOcean customers with instances in its SFO3 region experienced connectivity issues.
At around 10:25 AM (UTC), ThousandEyes observed issues as packet loss increased within DigitalOcean nodes located in Santa Clara, CA, impacting access to DigitalOcean resources in that area.
Explore the DigitalOcean network issues further in the ThousandEyes platform (no login required).
For customers, the disruption manifested as connectivity issues with applications and services in the affected DigitalOcean region, making the services unreachable. However, unlike a lot of the outage scenarios we discuss in this blog series where a page partially loads because frontend systems are still reachable, the DigitalOcean disruption saw its services rendered unavailable. Pages could not be loaded, resulting in a 100% packet loss. This indicates that the most likely cause was on the network side.
DigitalOcean later confirmed that impacted customers would have experienced “connectivity issues, latency, and timeout errors,” which aligns with ThousandEyes' observations.
Reddit Server Errors
On November 20 and 21, Reddit experienced a pair of significant outages linked to issues within their backend systems and services that totaled approximately three hours.
The first incident was reportedly caused by a bug in a software update that blocked a subset of users from accessing the platform; the second was attributed to a separate update that “caused stability problems.”
Users reported encountering a range of intermittent timeouts while trying to access the platform. It was possible to load the original post on a page, as this was likely cached by Reddit’s CDN provider, but because the content on the rest of the page is dynamic and constantly updating with new comments, pages did not load correctly. Instead, users were presented with a series of server-side errors, specifically 5xx errors, indicating problems on the server side that prevented requests from being fulfilled.
The impact of these issues appeared to be global, affecting users across various regions simultaneously. In response, Reddit conducted an investigation and identified what they described as a bug introduced by the recent update. Unfortunately, after a span of approximately 17 hours, the service interruption resurfaced, displaying similar characteristics to the initial outage.
It remains unclear at this juncture whether the re-emergence of the outage is a consequence of the same bug that was introduced with the original update, or if it is related to the subsequent “fix” that was applied. However, the similarities in symptoms during both incidents suggest a potentially unresolved underlying issue that warrants further investigation.
Issues for Microsoft 365
On November 25, a software change made by Microsoft to its Microsoft 365 service led to issues for users trying to access Exchange Online and impacted some “functionality within the Microsoft Teams calendar” service.
The outage was first observed around 2:00 AM (UTC), and it initially appeared to be intermittently impacting a small number of regions. During the incident, users experienced timeout errors, resolution failures, and, in some cases, HTTP 503 status codes, indicating that the service was unavailable.
Notably, the path to the edge servers did not show any problematic network conditions that could have been causing these timeouts, like increased packet loss at the edge. The “healthy” network conditions, when compared with the errors experienced by users, reveal that the most likely source of this disruption was related to some sort of backend service issue.
In other words, while the service front end was reachable, subsequent requests for components, objects, or other services were not consistently available. The intermittent nature of the problem meant that it wouldn’t have always been noticeable to end users, often presenting as slow or lagging responses.
ThousandEyes observed that this issue appeared to clear around 3:05 AM (UTC) initially but reappeared around 7:00 AM (UTC), manifesting again as timeouts and service unavailability errors. The second incident seemed to affect more regions than the first, with the number of impacted services displaying a cyclical pattern—essentially, an increase and decrease in the number of impacted servers, which may indicate a backend request load issue.
As the second outage progressed, ThousandEyes observed that in addition to the timeout and service unavailable errors, there was an increase in packet loss occurring at the edge of the Microsoft network. The observed loss rate was higher than it was in the previous disruption, though it did still occur at the egress of the Microsoft network, just like in the initial outage. Additionally, ThousandEyes didn’t see a consistent 100% loss throughout the period, which may suggest increased congestion when connecting to the services and an inability to reach or connect with backend services.
Around 9:00 AM (UTC), Microsoft acknowledged the widespread issues. It initiated a fix at 2:00 PM (UTC), which involved performing "manual restarts on a subset of machines that [were] in an unhealthy state." Soon after, the number of reported errors notably increased, with more servers being affected. An X post from the Microsoft 365 account indicated that “targeted restarts are progressing slower than anticipated for the majority of affected users.”
Microsoft later shared more about the root cause, reporting that the problems originated from "a change that caused an influx of retry requests routed through servers, impacting service availability."
To address the issues, Microsoft implemented optimizations to enhance its infrastructure's processing capabilities. After these changes, service appeared to be restored gradually. This aligns with ThousandEyes’ observations of a series of timeout-related errors where services failed to respond, along with HTTP 503 (service unavailable) and 404 errors. These errors indicated that, though communication with the frontend server was established, the server could not locate or reach the requested resource in the backend.
Worldline's Payment "Perturbations"
European instant payments provider Worldline observed “perturbations in its payments ecosystem” on November 28, with the impacts largely confined to Italy; although, there was some flow-on impact felt in other markets.
The “perturbations” were later explained to be connection issues after civil workers severed a third-party operated link to Worldline’s data centers in Italy. “Local authorities' gas pipe installation work severely damaged cables and the network of our supplier,” according to a status notification from impacted payment terminal maker Ingenico.
During the incident, the fintech company was “relentlessly working on identifying potential workarounds to reactivate services, waiting for the physical infrastructure to be restored.” As Ingenico's updates note, the incident had a broad impact across the banking and payments spectrum.
The outage caught the attention of the Bank of Italy and business groups, as the timing, although not Worldline’s fault, coincided with one of the year's most significant retail sales events.
By the Numbers
Let’s close by taking a look at some of the global trends ThousandEyes observed across ISPs, cloud service provider networks, collaboration app networks, and edge networks over recent weeks (November 18 - December 1):
-
The total number of global outages showed a downward trend during this period. In the first week, ThousandEyes observed a 2% decrease in the number of outages, dropping from 250 to 245. This downward trend continued into the following week (November 25 to December 1), when there was a significant reduction in outages; the number decreased from 245 to 130, marking a 47% decline compared to the previous week.
-
The downward trend in outage numbers was also observed in the United States. During the first week (November 18-24), outages decreased by 16%. The following week saw an even more significant drop, with outages halving from 135 to 67, representing a 50% decrease compared to the previous week.
-
From November 18 to December 1, an average of 54% of all network outages occurred in the United States. Although this represents a decrease from 64% during the previous period (November 4-17), it is still the second-largest figure recorded this calendar year. While technically these numbers continue the typical pattern seen in 2024, where U.S.-centric outages usually account for at least 40% of all reported outages, these recent two figures are higher than any others seen this year.
-
In November, there were 840 outages observed globally, marking a 6% increase from the 792 outages recorded in October. In the United States, outages rose from 333 in October to 501 in November. This trend is unusual compared to previous years, as total outages have typically decreased from October to November both globally and in the U.S.