INSIGHTS
Delivering assurance at the speed of AI

The Internet Report

The Top Internet Outages of 2025: Analyses and Takeaways

By Barry Collins
| | 12 min read
Internet Report on Apple Podcasts Internet Report on Spotify Internet Report on SoundCloud

Summary

Review some of 2025’s most notable Internet outages and incidents, with key learnings for ITOps teams to take into 2026.


This is The Internet Report, where we analyze outages and trends across the Internet through the lens of Cisco ThousandEyes Internet and Cloud Intelligence. I’ll be here every other week, sharing the latest outage numbers and highlighting a few interesting outages. As always, you can read the full analysis below or listen to the podcast for firsthand commentary.

Internet Outages & Trends 

Minimizing the impact of outages and disruptions is the number one priority for IT operations teams. Though not every outage can be avoided, ensuring you recover as quickly as possible is the key to business continuity.  

With that in mind, we’re going to look back at some of the biggest incidents from 2025, which provide plenty of learnings for ITOps teams to take into 2026. We’ll recap our analyses of each outage in turn, revealing what caused the issue and respective takeaways that can lessen the impact should a similar incident strike again. 

Read on to learn more or use the links below to jump to the sections that most interest you: 

Asana Outages (February 5 & 6)

Asana suffered two configuration-related outages in successive days in February. The first incident on February 5 was caused by a configuration change that overloaded server logs, causing servers to restart. Asana rolled back the change, but a day later a second outage with similar characteristics occurred. This second incident was quickly contained, leading to approximately 20 minutes of downtime. 

This pair of outages highlight the complexity of modern systems and how it’s difficult to test for every possible interaction scenario. However, after the second incident, Asana implemented new procedures to guard against cascading failures and transitioned to staged configuration rollouts, which shows that even if you can’t guard against every potential failure, you can often limit their impact.  

Read More 

Slack Outage (February 26)

Slack’s February outage was an object lesson in not relying on individual signals. At first glance, everything looked fine at Slack—network connectivity was good, there were no latency issues, and no packet loss on paths to Slack’s infrastructure. Users could log in and browse channels, but they experienced issues with various features—including sending and receiving messages. The problems lasted for nine hours. 

Had investigations focused solely on network connectivity or the HTTP 500 error messages that were returned, they may have been led down the wrong path. Only by combining these signals could one tell that this was likely a problem with the database routing layer, something later confirmed by Slack 

Explore This Outage in ThousandEyes | Read More 

X Outage (March 10)

This incident was another where it was vital to combine multiple signals to accurately diagnose the cause. ThousandEyes detected significant packet loss and connection errors at the TCP handshaking phase, meaning traffic was being dropped before a session could even be established.   

To users, the platform appeared to be “down,” with symptoms similar to those of a denial-of-service attack. However, there didn’t appear to be any visible BGP route changes or advertisements related to the X domain, which would typically occur as part of denial-of-service mitigation. It was a network-level failure, but not what it may have first appeared.   

Explore This Outage in ThousandEyes | Read More 

Zoom Outage (April 16)

DNS is a common suspect when outages occur (we’ll see another example later in this round-up). In this case, Zoom’s NS records disappeared from the TLD nameservers. Although the servers themselves were healthy throughout and were answering correctly when queried directly, the DNS resolvers couldn’t find them because of the missing records.  

Consequently, all Zoom services were unavailable for around two hours. The incident highlights how failures above an organization’s DNS layer can completely knock out services, even though there are no problems with the infrastructure itself, making it vital to consider all parts of your service delivery chain.  

Explore This Outage in ThousandEyes | Read Analysis 

Spotify Outage (April 16)

Spotify’s April outage was another of those incidents where everything looked fine on first inspection, but the backend pipeline was broken. The vital signs were all good: connectivity, DNS, and CDN all looked healthy, the app’s front-end loaded as normal. But tracks or videos refused to play.  

The combination of server-side errors with intact network connectivity and successful static content delivery were indicative of backend service issues. The incident illustrated how server-side failures can quietly cripple core functionality, while giving the appearance that everything is working normally.   

Explore This Outage in ThousandEyes  

Google Cloud Incident (June 12)

Google’s June incident is a reminder to trace a fault all the way back to source. The incident was triggered by an invalid automated updated that disrupted the company’s identity and access management (IAM) system. 

This meant users couldn’t use Google to authenticate on third-party apps, such as Spotify and Fitbit. However, the outage had other knock-on consequences. Cloudflare’s Workers KV, which provides authentication tokens and configuration data for other Cloudflare services, was also relying on Google Cloud for backend storage in some regions. That in turn broke services that were relying on Cloudflare.   

So what you had was a three-tier cascade: Google’s failure led to Cloudflare problems, which affected downstream applications relying on Cloudflare—even if they weren’t Google customers themselves.  

Read Analysis 

Cloudflare Outage (July 14)

BGP route withdrawals were at the heart of Cloudflare’s DNS problems in July. A configuration error introduction weeks before the outage occurred was triggered by an unrelated change, prompting Cloudflare’s BGP route announcements to vanish from the global internet routing table.    

With no valid routes, traffic couldn’t reach Cloudflare’s 1.1.1.1 DNS resolver, meaning users couldn’t reach the numerous websites and apps that rely on it. While the problem was resolved within an hour or so, it highlights how flaws in configuration updates don’t always trigger an immediate crisis, instead storing up problems for later.  

Explore This Outage in ThousandEyes | Read Analysis 

Commonwealth Bank Outage (October 2)

The value of combining multiple diagnostic observations was exemplified by the Commonwealth Bank outage in Australia. Not only did the company’s mobile app go down for around two hours, but also its website and even its ATM machines all failed simultaneously.    

The fact that three different channels with three different frontend technologies failed all at once eliminates app or UI issues. This timing and behavior pointed squarely at a shared backend dependency, which proved to be the case. This incident showed how a single failure can instantly disable every customer touchpoint—and why it’s vital to check all signals before reaching for remedies.   

Read More  

Azure Outages (October 9 & 29)

Microsoft Azure Front Door’s first October outage was caused by software defects that crashed edge sites in the EMEA region. Traffic through EMEA slowed or failed, while the Americas and other regions were largely unaffected. 

The second incident on October 29 was triggered by a configuration change, and this time the impact was worldwide, with customers seeing HTTP 503 errors and connection timeouts.  

Together, these two outages illustrate an important distinction: infrastructure failures tend to be regional with only certain customers affected, whereas configuration errors typically hit all regions simultaneously.   

Explore The October 9 Outage in ThousandEyes | Read October 29 Outage Analysis 

AWS DynamoDB Outage (October 20)

That said, regional failures can still have a global impact. The October failure of AWS DynamoDB originated in the US-EAST-1 region, but global services such as IAM and DynamoDB Global Tables depended on that regional endpoint, meaning the outage propagated worldwide.    

Major customers such as SlackAtlassian and Snapchat experienced long disruptions, some lasting for over 15 hours. The incident highlights how a failure in a single, centralized service can ripple outwards through dependency chains that aren’t always obvious from architecture diagrams.  

Explore This Outage in ThousandEyes | Read Analysis 

Cloudflare Outage (November 18)

Finally, Cloudflare’s November incident revealed how distributed edge combined with staggered updates can create intermittent issues.  

The incident was triggered by a bad configuration file in Cloudflare’s Bot Management system, exceeding a hard-coded limit. When proxies tried to load the oversized file they fell over, but because the proxies refreshed configurations on staggered five-minute cycles, we didn’t see a lights-on/lights-off outage, but intermittent, global instability. The fix was equally sporadic, as proxies loaded the correct configuration according to their refresh cycles.  

Explore This Outage in ThousandEyes | Read Analysis 

Key Takeaways from 2025

It's important to remember that single symptoms can be misleading; often, the real story emerges from combinations of signals. For instance, if the network seems healthy but users are experiencing issues, the problem might be in the backend. Simultaneous failures across channels can point to shared dependencies, while intermittent failures could indicate rollout or edge problems. Monitoring across all layers and understanding these patterns—timing, dependencies, and scope—helps narrow down possible causes and leads to faster resolution.

The complexity of modern systems, especially distributed ones, means it’s unrealistic to prevent every possible issue through testing alone. Instead, focus on building rapid detection and response capabilities, using techniques like staged rollouts and clear communication with stakeholders when incidents arise. The goal is to minimize the time between problem detection and recovery, maintaining trust and ensuring smoother operations even amid inevitable complexity.

Understanding your system’s architecture is critical for effective incident response. Different architectures have distinct failure signatures. For example, centralized systems tend to fail in predictable, cascading ways with wide-ranging impact, while distributed systems may localize failures but still face global effects from configuration changes. By familiarizing yourself with your system’s specific failure patterns before an incident, you can more quickly interpret symptoms and target your investigation during an outage. This knowledge not only speeds up detection but also helps you plan mitigation strategies in advance.


More Outage Insights

Stay updated throughout the year on Internet health and outage news by subscribing to The Internet Report podcast on Apple PodcastsSpotifySoundCloud, or wherever you get your podcasts. 

To experience how ThousandEyes can help you improve digital resilience and ensure business continuity in the face of disruption, start your free trial today. 

Subscribe to the ThousandEyes Blog

Stay connected with blog updates and outage reports delivered while they're still fresh.

Upgrade your browser to view our website properly.

Please download the latest version of Chrome, Firefox or Microsoft Edge.

More detail