Decoding Stealth Outages: Strategies for Digital Resilience

This is The Internet Report, where we analyze outages and trends across the Internet through the lens of ThousandEyes Internet and Cloud Intelligence. I’ll be here every other week, sharing the latest outage numbers and highlighting a few interesting outages. As always, you can read the full analysis below or listen to the podcast for firsthand commentary.

Internet Outages & Trends

Some outages really make themselves known. User complaints come flooding in, the service is completely down, the news is all across the headlines. IT operations teams do their very best to avoid these major outages.

But other outages—stealth outages, we’ll call them—are more subtle. They’re harder for ITOps teams to detect. Issues might only pop up intermittently, many users may not even notice the problem, and the disruption certainly won’t necessarily make front-page news. However, these stealth outages should not be ignored. They can still have a negative impact on user experience and on the business. And because they more easily slip under the radar, they may last longer and have more time to cause issues.

In recent weeks, Slack, Microsoft 365, and X experienced service disruptions that may fall in this “stealth outages” category.

Read on to learn more about what happened at Slack, Microsoft 365, and X, and explore other outage news and trends—or use the links below to jump to the sections that most interest you:

Slack outage
Microsoft 365 outage
X service disruption
By the numbers

Slack Outage

On May 12, Slack experienced a global disruption that left users unable to send messages, load channels or threads, use integrated applications, or even launch the app. ThousandEyes first observed the issue around 10:20 PM (UTC) and Slack acknowledged it shortly after. The outage lasted over an hour, with Slack declaring the issue completely resolved at 11:58 PM (UTC).

Though it had global impact, the outage largely flew under the radar. The incident’s subtlety stemmed partially from the time of day it happened: during the afternoon on the United States’ West Coast when the business day was winding down for that region, and just beginning for many in Asia Pacific. Additionally, the outage’s symptoms didn’t include any glaring indicators such as blank screens or obvious error messages. Unless users were actively engaged in sending or expecting messages, they likely remained largely unaware of the disruption.

Explore the Slack outage further in the ThousandEyes platform (no login required).

During the outage, ThousandEyes observed multiple cycles of page load times and availability falling and then rising, indicating inconsistent performance issues that could have been related to load (e.g. insufficient resource availability), as well as other connectivity problems like inconsistent paths to backend services. The outage’s somewhat intermittent nature suggested that it might have been associated with some backend routing or forwarding problems, or potentially a backend load issue.

ThousandEyes screenshot showing during the Slack incident, page load time and availability fluctuated — Figure 2. During the Slack incident, page load time and availability fluctuated

Additionally, ThousandEyes did not observe any corresponding network issues during the disruption, such as increased latency or significant network loss. Network latency remained constant throughout the outage, indicating that it was unrelated to the problems experienced.

ThousandEyes observed HTTP 500 internal server errors across multiple regions. The presence of a 5xx status code further supported that network connectivity or reachability issues weren’t the cause. While there was connectivity to the “front door” of the service, errors occurred when trying to interact with the backend services.

Screenshot showing ThousandEyes saw HTTP 500 internal server errors, indicating the outage’s root cause lay in the backend — Figure 3. ThousandEyes observed HTTP 500 internal server errors, indicating the Slack outage’s root cause lay in the backend

The presence of an HTTP 500 internal server error not only confirmed that the frontend service was up and available but also indicated that while it had requested information from the backend, the server was unable to fulfill the request. The actual error code can provide clues about the underlying cause. Although it is generally considered a generic server error, the HTTP 500 status indicates that the issue arose from an unexpected condition. Common causes of this error include server misconfigurations, issues within the website's code, or problems with database connections.

Slack later confirmed that the problem was indeed related to database connection issues. On its status page, Slack said that incident stemmed from issues with the communication link between their web application and the database routing layer.

While this specific Slack disruption may have seemed relatively minor for many users, it is essential for IT operations teams to promptly identify and resolve even intermittent or lower-profile issues. The significance of addressing such hiccups cannot be overstated, as any disruption that affects user experience—no matter how briefly or for a limited group—has the potential to damage brand reputation and customer loyalty.

For instance, consider an intermittent outage that occurs at a critical moment, such as when a user attempts to send an urgent, time-sensitive message at work. This type of disruption can lead to frustration and hinder productivity, ultimately impacting the user’s perception of the brand. In other scenarios, like those involving payment platforms, the stakes can be even higher. If a store believes that a customer’s payment has been successfully processed when, in reality, it has not, the repercussions can be severe, including financial losses for the business.

To identify and avoid even the most “stealthy” outages, ITOps teams must be diligent about monitoring their entire service delivery chain and analyzing any potential symptoms in context. Only then can organizations consistently spot issues and fully grasp a problem’s severity and impact. Additionally, teams should make sure they have an understanding of what “normal” looks like for all operational signals. Without this comprehensive benchmarking in place, minor outages or service disruptions can easily go unnoticed, leading to larger issues down the line.

Microsoft 365 Outage

On May 6, a disruption affected multiple Microsoft 365 services, including Teams and Outlook. The outage appeared to mainly impact users in North America. Like the Slack outage, the Microsoft incident seemed somewhat intermittent in nature, with the number of impacted servers exhibiting outage conditions appearing to fluctuate during the incident.

ThousandEyes observed network loss that appeared as forwarding loss on the penultimate hop before reaching the Microsoft network. This forwarding loss was mainly seen in connections originating from North America; tests conducted from other regions did not show the same issues. This indicates a problem within the service forwarding system at the edge of the Microsoft network, specific to that region, rather than a global failure affecting the backend services or applications.

The connections are managed by an Azure load balancing service, which Microsoft later confirmed to be Azure Front Door (AFD). This cloud-based service employs Layer 7 load balancing to distribute traffic across various regions and endpoints.

ThousandEyes screenshot showing forwarding loss was observed for paths in North America, but not in other regions — Figure 5. Forwarding loss was observed for paths in North America but not in other regions

Services traversing the impacted Azure Front Door (AFD) infrastructure were affected. In a service alert (MO1068615) in the Microsoft 365 admin center, Microsoft confirmed that the impacted services included, but were not limited to, Microsoft Teams, and the most likely cause lay within the routing configuration for its AFD cloud content delivery network.

"We're reviewing Azure Front Door (AFD) routing configurations and networking telemetry to isolate the source of the issue," Microsoft stated.

At 3:50 PM UTC / 11:50 AM (EDT), Microsoft confirmed that the issue had been resolved.

The outage primarily affected one region and was caused by a load balancing issue, which likely resulted in intermittent impacts. This means that even within the affected area, some requests may have been successful. Intermittent outages can often be likened to “stealth outages” because they can go unnoticed for extended periods, making it harder for both users and service providers to identify the underlying problem.

Several factors contribute to the difficulty in detecting these outages. For instance, if the incident occurs during off-peak hours—times when user traffic is low—many users may not realize there’s an issue until the traffic picks up later in the day. Additionally, subtle symptoms of degradation in service, such as a noticeable lag in performance or slow data population, may not immediately raise alarm bells for users who are still able to access the service, albeit with delays.

Moreover, individual function failures might occur in different parts of the service or application, which may not be used by all users. This means that while some functionalities are impaired for certain users, others might remain unaffected, creating a false sense of normalcy. This complexity can mask the severity of the outage and prolong resolution times, as users might not report issues unless they are severe enough to impact their experience significantly.

ThousandEyes screenshot showing Intermittent connectivity, with both delayed and complete page load times — Figure 6. Intermittent connectivity, with both delayed and complete page load times observed throughout the disruption

The better ITOps teams understand their entire service delivery chain and what their usual performance looks like, the more equipped they are to identify when something’s subtly amiss. Having automatic alerts in place to notify them when one area dips below typical levels can also be incredibly helpful.

X Service Disruption

Starting on the afternoon of May 8 (EDT), the social media platform X (formerly Twitter) experienced a service disruption that reportedly lasted until approximately 11 AM (EDT) / 3 PM (UTC) the next day. The problems manifested as a functional issue that disrupted notifications for a broad range of users across the globe. Some users reported receiving no new notifications even though they had alerts turned on for accounts that had posted multiple times during the incident.

During the disruption, ThousandEyes observed that connectivity to the service itself appeared to be intact, indicating that the issues likely weren’t network-related. However, ThousandEyes saw evidence indicative of backend service issues.

Several instances of the error message "maximum redirect count exceeded" were observed during the disruption. This error typically indicates that a request has experienced too many redirects while trying to access a resource. As a result, the client may end up in an infinite loop, continuously being redirected without ever reaching the intended destination.

Redirect loops can occur for various reasons. One common cause is the improper configuration of redirect rules. Changes or updates made to backend systems, such as server configuration adjustments, modifications to URL structures, or updates to frameworks, can lead to these loops. For example, the server might redirect a page to itself or to another page that eventually redirects back to the original page. This cyclical behavior prevents users from accessing the desired content.

Screenshot showing that ThousandEyes observed maximum redirect count exceeded errors during the X service disruption — Figure 7. ThousandEyes observed "maximum redirect count exceeded" errors during the X service disruption

By the Numbers

Let’s close by taking a look at some of the global trends ThousandEyes observed over recent weeks (May 5-18) across ISPs, cloud service provider networks, collaboration app networks, and edge networks.

The upward trend in global outages that we’ve observed since mid-April continued over the past two weeks, with outages increasing noticeably from May 5 to May 18. In the first week of this period (May 5-11), ThousandEyes recorded an 11% rise in outages, which increased from 444 to 495. This trend persisted into the following week (May 12-18), when outages rose another 8%, from 495 to 536.
The United States followed a similar pattern. During the first week (May 5-11), outages increased slightly from 95 to 97, a 2% rise. However, in the week of May 12-18, they surged from 97 to 149, representing a significant 54% increase.
From May 5 to May 18, the United States accounted for an average of 24% of all network outages, which is an increase from the 21% observed in the previous period (April 21 - May 4). This 24% marks the third consecutive period in which U.S.-based outages constituted less than 40% of all recorded outages.

Bar graphic showing global and U.S. network outage trends over eight recent weeks, March 24 to May 18, 2025 — Figure 8. Global and U.S. network outage trends over eight recent weeks

The Internet Report

Decoding Stealth Outages: Strategies for Digital Resilience

Summary

Internet Outages & Trends

Slack Outage

Explore the Slack outage further in the ThousandEyes platform (no login required).

Microsoft 365 Outage

X Service Disruption

By the Numbers

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

Summary

Internet Outages & Trends

Slack Outage

Explore the Slack outage further in the ThousandEyes platform (no login required).

Microsoft 365 Outage

X Service Disruption

By the Numbers

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

Subscribe to the ThousandEyes Blog

Stay connected with blog updates and outage reports delivered while they're still fresh.

Upgrade your browser to view our website properly.