Cloudflare Outage Analysis: November 18, 2025

ThousandEyes actively monitors the reachability and performance of thousands of services and networks across the global Internet, which we use to analyze outages and other incidents. The following analysis is based on our extensive monitoring, as well as ThousandEyes’ global outage detection service, Internet Insights.

Outage Analysis

Updated December 9, 2025

On November 18, starting at approximately 11:30 (UTC), services using Cloudflare began experiencing failures. Within minutes, a significant percentage of monitored Cloudflare-dependent services were returning errors.

The failures manifested as HTTP 500 responses—indicating server-side processing failures rather than network connectivity issues. Investigation revealed these failures occurred specifically when Cloudflare's bot management feature attempted to evaluate requests. Cloudflare's post-incident review identified the root cause to be a component-level failure in bot management service.

ThousandEyes screenshot showing HTTP 500 server errors indicating backend service issues. — Figure 1. HTTP 500 server errors indicating backend service issues

Organizations responded differently. Some executed DNS failover to bypass Cloudflare and serve directly from their own infrastructure, accepting the trade-off of restored availability against loss of Cloudflare's services. Performance data from these DNS changes revealed measurable differences between edge-cached and origin-based delivery, with impact varying by geography.

This analysis examines what broke in Cloudflare's bot management system, how failures manifested, mitigation decisions organizations faced, and the performance implications of moving between architectural models.

The Scale of Impact

Around 11:30 (UTC), ThousandEyes began observing a sharp increase in HTTP 500 internal server errors for monitored services dependent on Cloudflare.

Graph showing percentage of monitored Cloudflare-dependent services experiencing HTTP 5XX errors over time — Figure 2. Percentage of monitored Cloudflare-dependent services experiencing HTTP 5XX errors over time

The spike was rapid and sustained. Within minutes, a significant percentage of monitored services using Cloudflare were affected, with failures persisting through early afternoon UTC.

What made this scale significant was its cascading impact. Organizations use infrastructure providers for multiple functions: content delivery, DDoS protection, bot management, and DNS resolution. When a provider experiences an outage, the impact radiates through layers of dependent services. An end user might experience their email service, project management tool, and CRM all failing simultaneously—three seemingly unrelated services with one underlying cause. The common dependency becomes visible only during failures.

Understanding Bot Management: The Component That Failed

Not every service using Cloudflare experienced the outage. Failures appeared to be concentrated among services using Cloudflare's bot management feature. Understanding what bot management is and how it operates helps explain the failure patterns we observed.

What Bot Management Does

Cloudflare's bot management evaluates incoming requests to distinguish automated traffic from human visitors. When a request arrives at a Cloudflare edge location, the core proxy loads a feature file containing detection rules and parameters. These feed into multiple detection engines—machine learning models, heuristic checks, browser fingerprinting, and anomaly detection—that produce a score indicating the likelihood the request is automated.

Based on these scores, Cloudflare determines how to handle each request: allowing it through, presenting a verification challenge, or rejecting it. This evaluation can happen on requests before they reach the customer's origin servers. The feature file refreshes every five minutes from a central database because bot behavior evolves constantly—credential stuffing campaigns, content scrapers, and DDoS botnets modify their signatures to avoid detection.

What We Observed

Services using bot management showed a specific failure pattern that revealed exactly where the system broke.

During normal operation:

ThousandEyes screenshot showing response pattern of bot management operating normally — Figure 3. Response pattern of bot management operating normally

In normal operation, bot management evaluates incoming requests and applies its configured policies. In this case, the request to the index page (/) was evaluated and bot management determined a challenge was required. The HTTP 403 Forbidden response indicates access is denied until the challenge is completed. The response body contains Cloudflare's challenge page with the infrastructure needed to present that challenge: human-challenge.css, human-challenge.js, captcha.js, and related verification components.

During the outage:

ThousandEyes screenshot showing HTTP 500 response with no challenge components present observed during outage — Figure 4. HTTP 500 response with no challenge components present observed during outage

The same request to the index page (/) returned a 500 error. The challenge components are completely absent.

Challenge components are loaded by the core proxy when it successfully processes bot management configuration, consistent with Cloudflare's report that proxies crashed when the oversized feature file exceeded the hard-coded limit.

What Caused the Crash

According to Cloudflare's post-incident review, a database permissions change caused the system generating the bot management feature file to return duplicate rows. The file grew from approximately 60 features to more than 200—exceeding the core proxy's hard-coded limit.

When proxies attempted to load this oversized file during their five-minute refresh cycle, the bot management module within the proxy failed.

In waterfall analysis, we observed a complete absence of challenge infrastructure in responses. When bot management fails to initialize, the proxy can't load the components needed to present challenges—they never appear in waterfalls. This diagnostic signature—expected components completely missing rather than partially loaded or malformed—indicates failure during initialization rather than during processing.

Services returned HTTP 500 responses, meaning requests were reaching Cloudflare’s edge infrastructure and the proxy was operational enough to receive those requests and attempt to process them, but the bot management component within the proxy couldn't complete its evaluation.

Why the Outage Did Not Appear Uniform

During the outage period, we saw services fluctuating—some requests appeared to succeed while others appeared to fail with HTTP 500 internal server error. The pattern changed over time: the same test to the same service might succeed at 11:35 (UTC), fail at 11:40, then succeed again at 11:45.

ThousandEyes screenshot showing availability appeared to fluctuate throughout the outage period. — Figure 5. Availability appeared to fluctuate throughout the outage period

This fluctuating pattern makes sense given the crash mechanism. Every five minutes, proxies across Cloudflare's global network refreshed their feature files. At any given moment, some proxies had successfully loaded properly sized files and were operational, while others had crashed while attempting to load oversized files. As proxies continued their refresh cycles, the pattern shifted. The fix required stopping automatic file generation entirely, manually deploying a known-good version, and restarting crashed proxy instances across hundreds of locations globally.

How Organizations Responded

During the outage, some organizations executed DNS failover away from Cloudflare while others waited for Cloudflare to resolve the issue. Understanding how these DNS changes happened—and the patterns in which they occurred—reveals how organizations made real-time decisions during a major infrastructure failure.

What Changed

We observed fundamental shifts in where traffic was going. Tests that had been reaching IP addresses in Cloudflare's autonomous system (AS 13335) suddenly began reaching IP addresses in completely different autonomous systems.

ThousandEyes screenshot showing path visualization showing traffic routing to Cloudflare AS 13335 (before DNS change)

ThousandEyes screenshot showing path visualization showing traffic routing to Microsoft AS 8075 (after DNS failover) — Figure 6. Path visualization showing traffic routing to Cloudflare AS 13335 (before DNS change) and to Microsoft AS 8075 (after DNS failover)

The network paths changed entirely. Instead of routing through transit providers to Cloudflare's infrastructure, traffic routed to different destinations. The IP addresses responding to requests were different. The autonomous systems were different.

These path changes provide visibility into infrastructure dependencies that aren't obvious during normal operations. When services work, users rarely know whether they are reaching Cloudflare or origin infrastructure directly—the domain name stays the same regardless. Path-level monitoring reveals the actual infrastructure serving requests, making vendor dependencies visible. This visibility becomes critical during outages when determining whose infrastructure is actually failing.

What DNS Failover Achieves

When destination IP addresses change for a domain name, the DNS records for that domain have changed. Organizations using Cloudflare configure their DNS records to return IP addresses belonging to Cloudflare's infrastructure. Traffic flows to those Cloudflare addresses, where Cloudflare proxies receive connections and forward them to the service's origin infrastructure.

By changing DNS records to point to their own infrastructure, organizations executed DNS failover. Requests no longer traversed Cloudflare's proxy. Services became reachable again, though without Cloudflare services such as bot management and edge caching. Organizations accepted this trade-off: operational availability without these Cloudflare features versus continued disruption while waiting for Cloudflare to fix its issue.

When Organizations Executed DNS Failover

Graph showing AS path changes away from Cloudflare AS appeared to increase following Cloudflare's 11:48 (UTC) status update — Figure 7. AS path changes away from Cloudflare AS appeared to increase following Cloudflare's 11:48 (UTC) status update

The first DNS changes appeared around 11:48 (UTC), which coincided with Cloudflare publishing its status update acknowledging the issue. Organizations had been experiencing failures for approximately 30 minutes at that point. The coinciding timing shows organizations were waiting for confirmation before acting. Once Cloudflare acknowledged a systemic problem rather than isolated issues, DNS changes accelerated.

The rate continued increasing through early afternoon UTC. No single coordinated moment—instead, changes happened as respective organizations reached their own tolerance thresholds. The acceleration pattern suggests organizations making independent decisions as the outage persisted without resolution.

By the time Cloudflare deployed its fix at 14:30 (UTC), a percentage of monitored services had already executed DNS failover away from Cloudflare. The outage lasted long enough for these organizations to choose operational availability without Cloudflare over continued disruption waiting for a fix.

How Long They Remained on Origin Infrastructure

Bar graph showing distribution of time organizations remained on origin infrastructure post Cloudflare fix deployment. — Figure 8. Distribution of time organizations remained on origin infrastructure post Cloudflare fix deployment

Organizations returned to Cloudflare on widely varying schedules. Some reverted DNS changes within hours of Cloudflare deploying its fix at 14:30 (UTC)—suggesting either automated failback processes or administrators actively monitoring the situation and acting quickly once stability returned.

Others remained on alternative infrastructure for 24 hours or longer. Extended durations suggest organizations wanted verification periods before trusting Cloudflare again. A service being available doesn't mean it's stable. Organizations may have been waiting to see if the fix held, while monitoring Cloudflare's infrastructure for any signs of recurring issues.

These extended DNS changes indicate some organizations treated the failover as a significant operational state change requiring careful validation and coordination before reverting, rather than an immediate restoration once Cloudflare reported resolution. Organizations with similar technical capabilities made very different risk decisions about when to return.

Edge Caching vs. Origin-based Delivery

When organizations executed DNS failover away from Cloudflare during the outage, they moved from edge-cached delivery to serving directly from their own infrastructure. Understanding this architectural difference helps explain why performance changed even when services remained operational.

Cloudflare CDN: Caching at the Edge

Cloudflare's CDN operates at the application layer with cached content served from geographically distributed edge locations.

Flowchart showing Cloudflare CDN edge caching architecture with cached content served from geographically distributed edge locations — Figure 9. Cloudflare CDN edge caching architecture with cached content served from geographically distributed edge locations

When a request arrives at a Cloudflare edge location, Cloudflare checks whether it already has that content cached. For static assets—images, CSS files, JavaScript libraries, fonts—the answer is often yes. Cloudflare serves the content immediately from the edge location without any trip to origin servers.

For content that is not cached or is dynamic, the request proceeds to origin infrastructure. But for cacheable content, the request never leaves the edge location. A user in Germany gets cached content from Frankfurt. A user in Singapore gets cached content from Singapore. No backend processing, no database queries, no transmission from distant origin servers.

Cloudflare can handle TLS termination at the edge, transform content (compression, format optimization), and apply protocol optimizations—depending on customer configuration.

Origin-based Delivery: Serving From Infrastructure Locations

When organizations executed DNS failover, traffic began routing to their origin infrastructure—wherever that infrastructure was hosted. Some organizations operate their own data centers, while others host on cloud platforms like AWS, Azure, or Google Cloud.

Organizations may deploy infrastructure across multiple geographic regions—commonly 2-5 regions distributed globally to provide redundancy and serve different markets. However, even multi-region deployments involve a limited number of infrastructure locations compared to edge networks. Where an edge network like Cloudflare operates 300+ locations globally, a multi-region deployment might span US East Coast, US West Coast, Western Europe, and Asia-Pacific regions.

Flowchart depicting origin-based multi-region architecture showing traffic routing through load balancer to infrastructure locations in regions globally. — Figure 10. Origin-based multi-region architecture showing traffic routing through load balancer to infrastructure locations in regions globally

Load balancing optimizes how traffic reaches these infrastructure locations—selecting healthy endpoints, distributing load, and providing failover between regions. But a user in Tokyo accessing infrastructure in Western Europe still experiences that geographic distance as latency, even with optimized load balancing.

Geographic Impact: What We Observed

When organizations executed DNS failover to their origin infrastructure, we observed latency increases, with the impact varying by geography. X's DNS change provides a clear example of these patterns.

ThousandEyes screenshot showing increased latency observed for X.com following DNS failover away from Cloudflare ASN — Figure 11. Increased latency observed for X.com following DNS failover away from Cloudflare ASN

The availability chart (Figure 11) shows X experiencing failures on Cloudflare as availability drops at around 12:00 (UTC); then executing DNS failover and moving to origin infrastructure around 13:40 (UTC), where availability remained stable throughout. When X returned to Cloudflare the following day, we see brief availability disruption during the DNS transition before returning to normal operations.

The latency chart shows the corresponding network performance impact during this period, with a clear spike while X served from its origin infrastructure.

Test Location	Pre-outage (Cloudflare)	During Switch (Origin)	Post-return (Cloudflare)	Change During Switch
Seattle, WA	4.1ms	8.3ms	4.0ms	+102% (2x)
Dublin, Ireland	2.4ms	114ms	1.5ms	+4,650% (48x)
Tokyo, Japan	1.2ms	112ms	1.5ms	+9,233% (93x)

Figure 12. Network latency across different monitoring locations during three phases

The pattern is clear: agents outside the United States experienced dramatically larger latency increases than the U.S.-based agent.

Why geography mattered:

With Cloudflare's edge network, requests from Tokyo connected to Cloudflare's Tokyo edge locations, Dublin connected to Dublin edge locations, and Seattle connected to Seattle edge locations. Each agent benefited from proximity to edge infrastructure, resulting in single-digit millisecond latencies.

When X executed DNS failover, that proximity advantage disappeared. X operates infrastructure across multiple regions, but even multi-region deployments do not match the geographic distribution of edge networks with 300+ locations.

Tokyo's latency jumped from 1.2ms to 112ms—a 93-fold increase. Dublin saw a similar pattern: 2.4ms to 114ms, a 48-fold increase. These increases reflect the physical distance between test agents and X's infrastructure locations.

Seattle, presumably closer to X's infrastructure, saw latency roughly double from 4.1ms to 8.3ms. Still an increase, but nowhere near the international impact.

When X returned to Cloudflare, latencies returned to baseline levels across all locations. The edge proximity advantage was restored.

Performance impact and architectural trade-offs:

X's infrastructure is substantial. They operate a globally distributed network with presence in multiple regions. Load balancing optimizes traffic distribution across these locations. But even with significant infrastructure investment, the geographic distribution differs fundamentally from an edge caching network.

Based on availability metrics, X's service appeared reachable after the DNS failover, with users experiencing different performance levels depending on their location relative to X's infrastructure.

What NetOps Teams Can Learn

This incident reveals several patterns with direct implications for enterprise network operations:

Determining where failures occur: When services fail, distinguishing between your own infrastructure issues, local network problems, or vendor infrastructure failures determines who can fix the issue and what mitigation options exist. Path-level visibility—observing where along the network path requests stop progressing—provides this critical diagnostic information. During this incident, organizations with path visibility could quickly identify failures at Cloudflare's proxy layer rather than spend time investigating their own infrastructure.
Deciding when to act vs. when to wait: DNS failover restores availability but means losing vendor services—security features, performance optimization, traffic management. Waiting preserves those services but means continued disruption for unknown duration. The timing observed during this incident revealed organizations’ decision frameworks: Most waited for vendor confirmation before acting, suggesting tolerance thresholds around half an hour. Organizations benefit from defining these criteria in advance. At what point does confirmed vendor failure justify accepting degraded performance to restore availability?
Understanding architectural performance trade-offs: Infrastructure decisions involve performance trade-offs that become visible during outages. When organizations move from edge-cached delivery to origin-based serving, performance characteristics change in ways that vary by geography. Edge networks with hundreds of locations globally versus multi-region deployments with 2-5 locations create fundamentally different performance characteristics. The latency increases observed during this incident—93-fold for Tokyo, 48-fold for Dublin, twofold for Seattle—reflect geographic distance from origin infrastructure. Organizations should test latency from key user locations directly to origin infrastructure before incidents force the decision. The question is not whether origin-based delivery works, but whether the performance trade-off is acceptable for business continuity.
Deciding when to return after vendor recovery: The variance observed in return timing—from hours to multiple days—reflects different approaches to post-incident validation. Quick returns demonstrate confidence in vendor fixes and minimize time without vendor capabilities. Extended validation periods indicate organizations requiring proof of sustained stability rather than risking a second transition if issues recur. The return decision requires different criteria than the initial failover: How long should services run without issues before trusting the fix? What constitutes sufficient stability validation?

Previous Updates

[November 18, 2025, 6:30 AM PST]

On November 18, 2025, at approximately 11:30 (UTC), Cisco ThousandEyes began observing a global outage affecting cloud and CDN provider Cloudflare, impacting multiple Internet services including X, OpenAI, and Anthropic. While network paths to Cloudflare's front-end infrastructure appeared clear of any elevated latency or packet loss, Cisco ThousandEyes observed a number of timeouts and HTTP 5XX server errors, which is indicative of a backend services issue. While Cloudflare has confirmed they are implementing remediation, the outage is still ongoing.

Outage Analyses

Cloudflare Outage Analysis: November 18, 2025

Summary

Outage Analysis

The Scale of Impact