This is The Internet Report, where we analyze outages and trends across the Internet through the lens of Cisco ThousandEyes Internet and Cloud Intelligence. I’ll be here every other week, sharing the latest outage numbers and highlighting a few interesting outages. As always, you can read the full analysis below or listen to the podcast for firsthand commentary.
Internet Outages & Trends
Systems are complex. Predicting how change to one part of a distributed system will affect other areas can be difficult. This became especially evident on November 18, when cloud and CDN provider Cloudflare experienced an outage that impacted X, OpenAI, Anthropic, and many other services. Cloudflare reported that this outage was caused by a Bot Management configuration file that doubled in size due to a database query error. When the file exceeded a hard-coded limit, the traffic routing software crashed.
The Cloudflare incident raised several important takeaways for enterprise network operations, particularly around failure diagnosis, mitigation timing, and understanding architectural trade-offs.
Read on to learn more or use the links below to jump to the sections that most interest you:
- Modern Internet Resilience and Interdependence
- Scale and Cascading Impact
- Lessons for NetOps Teams
- By the Numbers
*Note: Cloudflare also experienced another outage a few weeks later, on December 5. For more insights on that outage, see this analysis.
Modern Internet Resilience and Interdependence
The outage stemmed from a flaw in Cloudflare’s Bot Management feature. When requests arrived, the Bot Management module was supposed to evaluate and score them to distinguish human visitors from bots. However, a database permissions change caused the system to generate oversized configuration files—over 200 features instead of the usual 60—exceeding a hard limit in Cloudflare’s proxies. When proxies tried to load these files during their regular five-minute refresh, the module crashed, leading to HTTP 500 errors for affected requests.
Crucially, the failures weren’t uniform. Since proxies refreshed on staggered schedules, some continued operating with old files while others crashed on new ones, resulting in fluctuating availability. The telltale sign: challenge components were missing entirely from HTTP responses, showing that the bot management module failed before it could even process requests.
This incident underscored how deeply interdependent today’s Internet services are. A single configuration error within a major service provider’s infrastructure can have immediate, global repercussions across a diverse array of platforms—demonstrating both the resilience and the fragility of distributed, cloud-centric architectures.
Explore the Cloudflare outage further in the ThousandEyes platform (no login required).
Scale and Cascading Impact
The failure was rapid and widespread. As many Cloudflare-dependent domains were affected, the impact cascaded across services that relied on Cloudflare. For some users, email, project management, and CRM tools all went down at once. This pattern—simultaneous failures in otherwise unrelated services—highlighted shared infrastructure dependencies and helped organizations quickly identify a vendor-level issue.
While it may feel that outages are becoming more frequent, the real trend is increasing impact due to the aggregation of services on shared platforms and the layering of dependencies. When a widely used platform like Cloudflare has an issue, it ripples across the ecosystem, affecting a multitude of downstream services and end users.
Organizations reacted in different ways. Some rerouted DNS records to bypass Cloudflare and serve content directly from their own infrastructure, restoring availability at the expense of losing Cloudflare’s additional services. Others waited for Cloudflare to resolve the issue. The first widespread reroutes coincided with Cloudflare’s public status update, suggesting many organizations waited for confirmation before acting. Some appeared to switch back soon after Cloudflare’s fix, while others seemed to wait hours or even a full day—likely to help ensure stability.
While some organizations rerouted traffic away from Cloudflare via DNS failover, this is not a trivial decision. Switching away from Cloudflare meant losing edge caching. Instead of serving content from Cloudflare’s 300+ global edge locations, organizations delivered directly from their own data centers or cloud regions—of which a single enterprise often has just a handful globally. These users likely saw latency increases, as ThousandEyes observed at a number of organizations during the outage. When services returned to Cloudflare, ThousandEyes observed latency returning to baseline, illustrating the geographic benefits of edge networks.
Lessons for NetOps Teams
Cloudflare’s Bot Management outage disrupted a myriad of services and likely led organizations to make quick, sometimes difficult decisions. While most services were restored within hours, the incident underscores the importance of understanding infrastructure dependencies, having pre-defined response plans, and regularly testing failover processes, so you can effectively minimize impacts on your users if a third-party provider you rely on experiences an outage. The choice to act—or not—depends on business impact, available capabilities, and confidence in the root cause and resolution timeline. Performance and resiliency trade-offs should be understood and planned for—not discovered during an incident.
This outage highlighted several key points:
- Rapid diagnosis matters: Recognizing patterns of simultaneous failures across unrelated services can quickly reveal vendor outages versus local issues.
- Decision timing is critical: Organizations need clear criteria for when to switch away from a vendor, balancing operational availability against loss of vendor services and the effort required.
- Test failover procedures: Regularly validate that alternative infrastructure and configurations actually work, as switching during an outage situation is much harder without preparation.
- Understand architectural trade-offs: Edge caching delivers global low-latency performance, while origin delivery—even if redundant—can’t match the reach or performance for international users.
- Feature-level awareness: Not all vendor failures are total; knowing which features your services depend on helps in targeted mitigation.
For more insights on the Cloudflare outage, listen to the podcast and read our full outage analysis.
By the Numbers
Let’s close by taking our usual look at some of the global trends that ThousandEyes observed across ISPs, cloud service provider networks, collaboration app networks, and edge networks over recent weeks (November 17-30).
Global Outages
- From November 17-23, ThousandEyes observed 91 global outages, representing a 29% decrease from 128 the prior week (November 10-16).
- During the week of November 24-30, global outages increased 27%, rising to 116.
U.S. Outages
- The United States saw outages decrease to 31 during the week of November 17-23, representing a 30% decrease from the previous week's 44.
- During the week of November 24-30, U.S. outages increased 65%, rising to 51.
- Over the two-week period from November 17-30, the United States accounted for 40% of all observed network outages.
Month-over-month Trends
- Global network outages decreased 40% from October to November 2025, dropping from 701 incidents to 421.
- The United States showed a more pronounced 62% decrease, with outages falling from 404 in October to 153 in November. This reduction coincides with the Thanksgiving holiday period in the U.S. Though outages did rise slightly during the actual week of Thanksgiving, numbers during that week remained relatively low. The overall decrease in total outages seen during November is likely due to slowing of maintenance activity and other work during a holiday month.
- In November, the United States accounted for 36% of all observed network outages, compared to 58% in October.