Parsing Recent Cloudflare and Venmo Outages

This is The Internet Report, where we analyze outages and trends across the Internet through the lens of Cisco ThousandEyes Internet and Cloud Intelligence. I’ll be here every other week, sharing the latest outage numbers and highlighting a few interesting outages. As always, you can read the full analysis below or listen to the podcast for firsthand commentary.

Internet Outages & Trends

The first week of December 2025 told a story about modern infrastructure that’s important for IT operations (ITOps) teams to pay attention to. On Sunday, AWS and Google Cloud announced a collaboration designed to enable smooth failover between cloud providers. On Tuesday, AWS unveiled its new AI-powered DevOps Agent, promising to help engineers diagnose and recover from outages. Both announcements underscored the current industry focus on resilience and deep visibility into service delivery chains. And that same week, two outages at Cloudflare and Venmo exemplified why that resilience is so challenging to achieve as dependency architecture becomes more complex.

Venmo experienced a multi-hour outage during peak evening transaction hours that prevented users from accessing their money. Two days later, Cloudflare experienced a 25-minute global outage that cascaded through hundreds of applications dependent on Cloudflare's infrastructure.

Both incidents shared a common characteristic: application-layer failures atop completely healthy network infrastructure, with fault domains that could only be identified through the precise separation of network and application telemetry.

Against the backdrop of these outages and the announcements from AWS and Google Cloud, it may feel like the Internet is becoming less reliable. But the Internet's core infrastructure remains resilient—what's changed is that when failures do occur, they now have greater consequences.

Read on (and tune into the podcast) to understand how the Cloudflare and Venmo incidents illustrate how changes in the dependency architecture are amplifying the impact radius of individual failures—even as the underlying Internet infrastructure maintains its historical resilience. (Use the links below to jump to the sections that most interest you.)

Cloudflare Outage
Venmo Outage
The Challenge of Layered Dependencies
By the Numbers

Cloudflare Outage

On December 5, Cloudflare experienced a global outage lasting approximately 25 minutes, impacting services worldwide. Cloudflare identified the root cause was a configuration change designed to address a new security vulnerability in React Server Components (CVE-2025-55182). The update increased buffer sizes to help ensure customer protection, but during the gradual rollout, engineers noticed errors appearing in an internal testing tool. Rather than potentially delaying security protection, the decision was made to disable the testing tool via Cloudflare's global configuration system—a system designed to propagate changes across the entire network within seconds.

ThousandEyes screenshot showing Square.com availability and map view during the December 5 Cloudflare incident — Figure 1. Square.com availability and map view during the December 5 Cloudflare incident

That configuration change exposed a code path that had existed in production for years but had never been triggered—until now, resulting in HTTP 500 Internal Server Errors for all requests on affected infrastructure. Availability instantly dropped from 100% to zero at 8:47 AM (UTC) worldwide, then snapped back to normal at 9:12 AM (UTC). Unlike gradual or regional outages, this incident was uniform and immediate, indicating a code-level failure triggered by global configuration rather than gradual rollout.

What’s important is not just the scale, but the diagnostic clarity: ThousandEyes observed that HTTP 500 errors were returned exceptionally quickly, indicating the failure was occurring at Cloudflare's edge infrastructure before requests could reach backend services. Critically, no adverse network conditions—no packet loss, latency spikes, or routing issues—coincided with the outage. The combination of server-side errors (HTTP 500) and healthy network conditions allowed analysts to rapidly isolate the fault domain to Cloudflare's proxy layer rather than network infrastructure.

ThousandEyes screenshot showing page load waterfall for Square.com revealing significantly faster than normal response times — Figure 2. Page load waterfall for Square.com revealing significantly faster than normal response times

Cloudflare had also experienced another outage a few weeks before on November 18. However, that earlier outage differed in both technical details and failure pattern. The incident on November 18 showed a patchwork of regional impacts, intermittent errors, and partial failures and successes as configuration changes rolled out unevenly across the network. In contrast, the December 5 outage was sharply binary. The monitoring data tells a precise story, wherein a single global configuration change propagated instantly, leading to immediate and total failure, then an equally sudden recovery.

Despite their differences, both incidents, involved configuration changes, though with different failure modes. November 18 involved a configuration file that exceeded hard-coded size limits, while December 5 exposed a latent code bug that had existed undetected for years. The root causes differed, but the pattern of how issues spread remained similar—a configuration change exposed underlying issues that then cascaded globally.

This is why layered visibility matters. With only application monitoring, you'd see widespread failures but couldn't definitively isolate whether the issue was in the target infrastructure or somewhere in the network path. With only network monitoring, you'd see healthy paths but have no visibility into the application failures. Together, they provide fault domain isolation that enables rapid, accurate response.

The 17-day gap between incidents also highlights another industry-wide challenge that makes end-to-end visibility doubly important: the gap between recognizing necessary improvements and executing them. After the November outage, Cloudflare announced comprehensive prevention plans—gradual rollouts with health validation, fail-open error handling, improved break-glass procedures. Cloudflare stated in their post-incident review that these prevention measures were not yet deployed at the time of the December 5 incident. This execution gap is something all NetOps teams must deal with--updates take time and it's entirely possible that another outage might occur before you've finished making the necessary changes. As a result, it’s vital to have deep visibility across your entire service delivery chain to quickly catch any issues and mitigate efficiently to minimize impacts on users.

Venmo Outage

Two days before the Cloudflare outage, Venmo experienced its own infrastructure disruption. Starting around 6:30 PM (EST) on December 3 (11:30 PM (UTC)), users began reporting widespread inability to send or receive payments. The incident lasted several hours before Venmo announced that services were working again.

ThousandEyes screenshot featuring Venmo page load waterfall showing HTTP 503 errors when loading the index page — Figure 3. Venmo page load waterfall showing HTTP 503 errors when loading the index page

The technical distinction between error codes seen in the Cloudflare outage vs. the Venmo outage tells different diagnostic stories. Cloudflare returned HTTP 500 errors (Internal Server Error)—indicating something broke in code or logic. Venmo returned HTTP 503 errors (Service Unavailable)—suggesting the service was temporarily unable to handle requests, typically pointing to capacity issues, rate limiting, or graceful degradation rather than code crashes. Different failure modes require different remediation approaches.

But the more important story isn't the specific technical failure. It's what Venmo represents: a payment service that sits atop multiple infrastructure layers, where each layer represents a potential point of failure. Modern payment services layer upon cloud compute platforms, database services, authentication providers, network infrastructure. Users interact with the payment app, but the root cause of unavailability might be several layers deep in the dependency stack.

In October 2025, Venmo as unavailable during an AWS US-EAST-1 outage. Users experiencing issues with Venmo likely assumed the problem originated with Venmo itself, when the root cause was a failure several layers deeper in the dependency stack. This ripple effect—where a cloud provider outage cascades through dependent services and ultimately impacts end users—exemplifies modern infrastructure interdependency.

Screenshot showing that ThousandEyes observed HTTP 503 errors across all locations with no coinciding network issues — Figure 4. During the outage, ThousandEyes observed HTTP 503 errors across all locations with no coinciding network issues

This isn't unique to payment apps. It's fundamental to modern Internet architecture: content delivery layered on cloud infrastructure, authentication federated across identity providers, mobile apps depending on API gateways depending on microservices depending on databases. Each layer adds convenience and capability. However, each layer also adds potential failure modes. The Internet itself remains resilient by design, but service layers amplify the impact of failures.

The Challenge of Layered Dependencies

To understand why these incidents matter more than similar failures might have a decade ago, we need to examine how Internet architecture has fundamentally changed.

The Internet's core architecture—its routing protocols, distributed DNS, and packet-switched networks—remains fundamentally resilient. The protocols that route traffic around failures still work. The distributed nature of Internet infrastructure continues to provide redundancy. What's changed isn't the Internet's resilience. It's how services are now built on top of it.

Modern architecture has shifted from distributed, independent services to layered dependencies on shared infrastructure. Applications no longer connect directly to users—they go through CDNs, API gateways, cloud platforms. Authentication no longer lives in each application—it federates through OAuth providers. Payment processing doesn't happen within apps—it abstracts through dedicated services. This layering adds efficiency, security, and features. But it also amplifies the impact radius when foundational layers fail.

When foundational services fail, every dependent service fails simultaneously—not because of issues in those services themselves, but because the shared dependency is unavailable. For example, when a CDN experiences an outage affecting a significant portion of its customer base, it's not just the CDN that becomes unavailable. It's hundreds or thousands of downstream applications and services that depend on that layer that become unavailable. Not because those applications failed, but because they share a common dependency that's architecturally difficult to route around. The disruption cascades through every dependent service.

And here's the critical difference from decades past: There's frequently no fallback channel. The Internet is now the primary and often sole operational fabric for business and daily life. No paper backup systems. No manual alternative processes. No phone-based fallback. When digital infrastructure fails, operations may stop completely. This reality—combined with layered dependency architecture—makes individual incidents feel more consequential and causes outages to feel more frequent in general, even when the underlying Internet infrastructure maintains its historical reliability.

The impact radius math isn't linear—it's multiplicative through dependency chains. A single CDN outage affects hundreds of downstream applications. A payment processor failure impacts thousands of merchants unable to transact. A cloud region disruption takes major swaths of Internet services offline. An authentication provider issue locks entire ecosystems out. Impact propagates through services layered upon services layered upon infrastructure.

Key Takeaways for ITOps Teams

Here are three core takeaways for ITOps teams based on these incidents:

The Internet's core infrastructure remains resilient by design. Routing protocols, DNS, packet switching—these continue working as designed. What's changed isn't infrastructure reliability but how we build on top of it. Modern service architectures layer dependencies in ways that concentrate risk. When foundational services experience issues, impact cascades through every dependent layer.

Multi-layer visibility matters. Network metrics can appear healthy while application layers fail. This separation enables rapid fault isolation—understanding where problems aren't is as valuable as knowing where they are. Single-layer monitoring creates diagnostic ambiguity.

Preparation is key to enabling resilience. Failover requires tested procedures, not just configured alternatives. The gap between theoretical capability and practical readiness shows up when seconds matter.

By the Numbers

Let's close by taking our usual look at some of the global trends that ThousandEyes observed across ISPs, cloud service provider networks, collaboration app networks, and edge networks over recent weeks (December 1-14).

Global Outages

From December 1-7, ThousandEyes observed 205 global outages, representing a 77% increase from 116 the prior week (November 24-30).

During the week of December 8-14, global outages increased 78%, rising to 364.

U.S. Outages

The United States saw outages increase to 84 during the week of December 1-7, representing a 65% increase from the previous week's 51.

During the week of December 8-14, U.S. outages increased 124%, rising to 188.

Over the two-week period from December 1-14, the United States accounted for 48% of all observed network outages.

Bar chart showing global and U.S. network outage trends over the eight weeks from October 10 to December 14, 2025 — Figure 5. Global and U.S. network outage trends over eight recent weeks

The Internet Report

Parsing Recent Cloudflare and Venmo Outages

Summary

Internet Outages & Trends

Cloudflare Outage

Venmo Outage

The Challenge of Layered Dependencies

Key Takeaways for ITOps Teams

By the Numbers

Global Outages

U.S. Outages

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

Summary

Internet Outages & Trends

Cloudflare Outage

Venmo Outage

The Challenge of Layered Dependencies

Key Takeaways for ITOps Teams

By the Numbers

Global Outages

U.S. Outages

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

Subscribe to the ThousandEyes Blog

Stay connected with blog updates and outage reports delivered while they're still fresh.

Upgrade your browser to view our website properly.