Explaining the AWS Outage & Other Recent Incidents

This is The Internet Report, where we analyze outages and trends across the Internet through the lens of ThousandEyes Internet and Cloud Intelligence. I’ll be here every other week, sharing the latest outage numbers and highlighting a few interesting outages. As always, you can read the full analysis below or listen to the podcast for firsthand commentary.

Internet Outages & Trends

Cloud infrastructure outages rarely have simple, isolated causes. While there may be an initial root cause, the outage’s impact can cascade through interconnected services, causing other disruptions and creating a complex web of problems to resolve.

Diagnosing these incidents requires observing multiple infrastructure layers simultaneously, as network and application layers often tell different parts of the story—each revealing different aspects of what's occurring at different times. These temporal patterns across layers provide crucial clues for understanding the sequence of events and identifying root causes.

This complexity was evident across several major incidents in recent weeks, from an Amazon Web Services (AWS) outage that affected services like Slack, Atlassian, Snapchat, and others, to disruptions impacting Microsoft Azure, AT&T, Vodafone, and YouTube.

Read on to learn more or use the links below to jump to the sections that most interest you:

AWS Outage

Microsoft Azure Front Door Outage

Microsoft-AT&T Service Disruption

Vodafone Outage

YouTube Outage

By the Numbers

AWS Outage

On October 20, Amazon Web Services (AWS) experienced a significant disruption in its US-EAST-1 region that lasted over 15 hours and impacted several services that rely on AWS including Slack, Atlassian, Snapchat, and others. What started as a DNS issue evolved into a complex cascade of infrastructure failures across multiple AWS services, demonstrating how a single technical defect can trigger a chain reaction through interconnected cloud systems.

ThousandEyes observed packet loss at AWS edge nodes in Ashburn, Virginia—the disruption's first observable symptom. This loss occurred at the last hop before AWS infrastructure, rather than on customer networks or intermediate providers, suggesting that the problem originated within AWS's network boundary.

Explore sample impact of the outage within the ThousandEyes platform (no login required)

The Root Cause: A DNS Race Condition

AWS subsequently disclosed that the root technical cause was a latent race condition in Amazon DynamoDB's automated DNS management system. DynamoDB maintains hundreds of thousands of DNS records to operate its massive fleet of load balancers, and uses an automated process to constantly update these records to handle capacity changes and failures.

The race condition involved an unlikely interaction between two independent DNS management components. One component applying an older DNS plan was unusually delayed, and while it worked through updates, another component applied a newer plan and then triggered a cleanup process that deleted the older plan. Due to the timing, the delayed component overwrote the newer plan with its older plan at the exact moment the cleanup process deleted it. This left the regional DynamoDB endpoint (dynamodb.us-east-1.amazonaws.com) with an incorrect empty DNS record—and the system in an inconsistent state that prevented automated recovery.

At 6:49 AM (UTC) on October 19, all systems attempting to connect to DynamoDB via the public endpoint immediately began experiencing DNS failures. This included both customer traffic and internal AWS services that rely on DynamoDB. AWS engineers identified the DNS issue by 7:38 AM (UTC) and restored DNS information by 9:25 AM (UTC) on October 20, allowing connections to resume as cached DNS records expired over the following 15 minutes.

The Cascading Failures

However, restoring DNS didn't immediately restore all services. The three-hour DynamoDB outage had triggered cascading failures across systems that depend on it. EC2's Droplet Workflow Manager (DWFM)—responsible for managing the physical servers that host EC2 instances—requires DynamoDB to maintain leases on these servers. During the DynamoDB outage, DWFM couldn't complete required state checks, causing lease management to fail. Even after DynamoDB connectivity was restored, the accumulated state inconsistencies from those lost leases continued to cause problems.

Between 12:30 PM and 9:09 PM (UTC), network load balancers experienced health check failures, resulting in increased connection errors. New EC2 instance launches failed or experienced connectivity issues from 9:25 AM until 8:50 PM (UTC). Services like Amazon Connect, AWS Security Token Service, and Amazon Redshift experienced extended impact as the effects rippled through dependent systems.

What This Demonstrates

This incident illustrates how modern cloud infrastructure failures rarely have simple, isolated causes. While the root technical defect was the DNS race condition, the operational impact came from how that defect cascaded through interconnected services. Fixing the initial DNS problem revealed layers of downstream effects—state inconsistencies, failed health checks, launch failures—that required additional time to resolve. Like untangling a chain, addressing one problem exposed the next issue that had been masked or caused by the initial failure.

The complexity of cloud architecture means that even after identifying and fixing a root cause, infrastructure teams must work through the cascade of effects it triggered. Understanding the complete picture requires observing multiple layers simultaneously, as each layer may reveal different aspects of the failure at different times.

To learn more about what ThousandEyes saw during the AWS outage and key takeaways this incident leaves for infrastructure teams, tune into the podcast and read our full outage analysis.

Microsoft Azure Front Door Outage

On October 9, Microsoft was impacted by two separate service disruptions: an Azure Front Door outage and a Microsoft-AT&T service disruption. Though they happened on the same day, the incidents did not appear related. We’ll cover both in this blog post, starting with the Azure Front Door outage.

At approximately 7:50 AM (UTC) on October 9, Microsoft Azure Front Door (AFD) experienced a platform-level failure that impacted users’ ability to access Microsoft 365, Azure Portal, Entra Admin Portal, and other services that depend on AFD for content delivery and routing. The incident primarily affected customers in Europe, the Middle East, and Africa, with North American users seeing little impact. ThousandEyes data indicated that the main disruption lasted about five hours, with availability fully restored by 12:50 PM (UTC). However, some customers continued to experience elevated latency until approximately 4:00 PM (UTC) when Microsoft declared the incident fully mitigated.

Microsoft reported that the incident stemmed from software defects in AFD's infrastructure that caused edge sites in Europe and Africa to crash. The remaining sites became overloaded with redistributed traffic, causing delays and timeouts for users.

Explore the outage within the ThousandEyes platform (no login required).

This incident demonstrates how software defects can have unexpected downstream impacts. While IT operations team teams do their best to deeply understand their full service delivery chain and fix any potential issues, sometimes bugs can lie dormant until activated by just the right conditions. These types of edge case bugs may go unnoticed, flying under the radar of normal configuration validation testing that teams typically do before pushing a configuration change into production. When this happens, ITOps need to be able to quickly pinpoint the problem and mitigate the issue.

What Is Azure Front Door?

To understand what happened during this outage, it’s important to first understand Azure Front Door (AFD)—and why its configuration settings matter.

AFD is Microsoft's global, cloud-native content delivery network (CDN) and application delivery service. It operates as the "front door" for Microsoft's services and customer applications handling three critical functions:

First, it acts as a traffic controller, routing user requests to the closest and best-performing server. This prevents any single server from getting overloaded and helps ensure users reach one that's working properly.

Second, it stores content in locations closer to users, so things like webpages and files don't have to travel as far.

Third, it handles security connections at the edge, managing the secure handshakes closer to users rather than at Microsoft's main data centers.

When AFD fails, users may be unable to reach the services behind it—in this case, Azure Portal, Microsoft 365, and other critical services across Microsoft's ecosystem.

AFD operates through regional clusters, with each major region running its own infrastructure. These clusters include a control plane that manages configurations and routing decisions, and a data plane that processes actual traffic. This regional architecture becomes important when understanding why this outage affected some regions but not others.

What Is an AFD Tenant Profile Setting?

Each tenant (customer organization) using Azure Front Door has a configuration profile that defines how their traffic is handled. These profiles include domain configurations, routing rules, security settings, and other parameters that control how AFD processes requests.

Microsoft reported that a specific sequence of operations created faulty metadata in these profiles, which ultimately triggered the outage when Microsoft attempted to clean it up.

What Happened During the Azure Front Door Outage?

The Azure Front Door outage involved two connected software bugs. According to Microsoft, the incident began when a software defect in a new version of AFD's control plane created faulty metadata during a specific customer sequence of operations. Microsoft's automated systems detected the problem and blocked the bad metadata from spreading—preventing immediate customer impact.

After disabling the new control plane, Microsoft began cleaning up the faulty metadata. However, this cleanup operation triggered a different, latent bug in the data plane. This second bug caused infrastructure components to crash at edge sites across Europe and Africa. As AFD automatically redistributed traffic to the remaining healthy edge sites, these sites became overloaded, causing delays and timeouts for users trying to access Microsoft 365, the Azure Portal, and other affected services.

During the outage, ThousandEyes observed symptoms consistent with Microsoft’s reported cause and effects. We saw connection timeouts and service-related errors, as well as packet loss within Microsoft's network infrastructure in affected regions. ThousandEyes also observed cases where AFD-fronted services were unreachable from affected geographies.

The outage also appeared to have three distinct phases:

Phase 1: Initial Degradation

During the outage’s initial phase, from 7:50 AM to about 9:20 AM (UTC), ThousandEyes observed intermittent delays and timeouts when testing connectivity to Microsoft services. Some test requests succeeded, while others failed. ThousandEyes saw a gradual increase in error rates over approximately 90 minutes as more edge sites became impacted.

Phase 2: Peak Impact

At the outage’s peak, from about 9:20 AM - 12:50 PM (UTC), ThousandEyes observed widespread failures when attempting to reach affected services from test agents in impacted regions, consistent with Microsoft's reported peak failure rates of approximately 17% in Africa and 6% in Europe. ThousandEyes also saw 100% forwarding loss at specific Microsoft network nodes in affected regions.

Screenshot showing during the outage, ThousandEyes saw 100% forwarding loss at Microsoft's network nodes in affected regions — Figure 1. During the outage, ThousandEyes observed 100% forwarding loss at Microsoft's network nodes in affected regions

Phase 3: Recovery

Microsoft reported that automated restarts of infrastructure components began at 9:08 AM (UTC), with manual intervention for resources that didn't recover automatically. Critical services like the Azure Portal performed failover operations to route traffic away from affected endpoints. As these recovery efforts progressed, ThousandEyes observed intermittent connectivity as capacity was gradually restored. By 12:50 PM (UTC), availability was fully restored, though some customers continued to experience elevated latency.
ThousandEyes observed services returning to normal performance levels as capacity was restored. Microsoft reported that the incident was fully mitigated by approximately 4:00 PM (UTC).

Why Did the Azure Front Door Outage Only Affect Certain Regions?

As mentioned, AFD operates through regional clusters. Each region runs its own control plane and data plane infrastructure.

This regional architecture may help explain why the October 9 outage primarily affected Europe and Africa, while the Americas and other regions remained largely unaffected. Without official confirmation from Microsoft, we can only speculate, but here are the most likely explanations:

First, the specific customer operations that triggered the control plane defect may have been more common in Europe and Africa. Data residency requirements like GDPR can sometimes lead to specific configuration patterns in these regions.

Alternatively, the Europe and Africa regional clusters may have been running a different version of the AFD platform code where the latent bug existed, perhaps due to staggered regional deployments.

Finally, the infrastructure failures may have simply been concentrated in Europe and Africa clusters. Once these failures occurred, they created a cascading effect as traffic was redistributed, amplifying the impact within those regions.

What Can ITOps Teams Learn From the Azure Front Door Outage?

This incident demonstrates how even remediation operations can sometimes have unexpected impacts when they awaken a dormant bug. Despite thorough testing, it’s possible a defect can go undetected until it is hit with exactly the right conditions. And modern CDN and edge platforms allow for extensive customization which creates a vast amount of potential edge case combinations. It’s impractical for ITOps teams to test all possible configuration permutations to catch all possible issues.

As a result, encountering latent bugs is a common challenge. ITOps teams must be able to quickly identify the source of an unexpected issue and take steps to rectify it to minimize impact on their users.

The outage is also a reminder of the potential benefits of dividing a global system into multiple regional architectures. AFD’s regional architecture appeared to prevent the outage from spreading to the United States and kept it more contained than it likely would have been otherwise. However, IT operations teams must be aware that within regions, issues can still spread. In the case of the AFD outage, the infrastructure crashes in Europe and Africa had ripple effects that impacted all customers in those regions as load was redistributed to remaining edge sites. ITOps teams must plan for such ripple effects when preparing their outage response strategies so they’re ready to effectively respond to outages that start small but quickly grow.

Microsoft-AT&T Service Disruption

At approximately 6:11 PM (UTC) on October 9, a network-related incident affected multiple Microsoft services, including Microsoft 365, Teams, Outlook, and Azure services. The service disruption impacted United States users nationwide who were attempting to access Microsoft services through AT&T's network. Other Internet service providers (ISPs) remained unaffected. Lasting a little over 30 minutes, the incident was fully resolved at 6:48 PM (UTC) when Microsoft completed traffic rerouting operations.

What appeared to cause this Microsoft-AT&T service disruption? Read on to discover what ThousandEyes observed and important takeaways for network operations (NetOps) teams.

Explore the service disruption within the ThousandEyes platform (no login required).

What Happened During the Microsoft-AT&T Service Disruption?

Microsoft reported that the incident stemmed from a misconfigured portion of network infrastructure in North America. ThousandEyes observed forwarding loss at Microsoft edge nodes, with AT&T's network successfully delivering packets to those nodes. These observations and the incident’s characteristics—simultaneous nationwide impact affecting only AT&T customers and rapid 37-minute resolution—suggested an issue affecting the connection between Microsoft and AT&T.

Microsoft's edge infrastructure typically uses separate configurations to handle traffic from different ISP partners. The observed forwarding loss at Microsoft edge nodes, combined with successful packet delivery through AT&T's network, suggested the misconfiguration affected how traffic was handled at the connection point between the two networks. Additionally, traffic through AT&T to non-Microsoft destinations appeared unaffected, further indicating the issue was specific to the Microsoft connection.

ThousandEyes screenshot showing significantly elevated packet loss between AT&T and Microsoft networks — Figure 2. ThousandEyes observed significantly elevated packet loss between AT&T and Microsoft networks

Microsoft's resolution through traffic rerouting was consistent with ThousandEyes' observations of the recovery patterns and further suggested that the cause was an issue impacting the connection between Microsoft and AT&T. Also supporting this root cause, the rapid 37-minute resolution and simultaneous nationwide impact across all AT&T connection points were consistent with a configuration issue that could be quickly addressed through traffic rerouting.

What Can NetOps Teams Learn From This Service Disruption?

This Microsoft-AT&T service disruption leaves some important takeaways for network operations teams.

Observing Failure Points Aids Diagnosis: When investigating network issues, identifying where packets fail provides valuable diagnostic information. ThousandEyes observed that AT&T's network successfully delivered packets to Microsoft's edge nodes, where forwarding loss then occurred. This visibility helped narrow the investigation to the connection point between the two networks, rather than issues within AT&T's broader network infrastructure. For NetOps teams, having monitoring that spans across network boundaries is essential for quickly identifying which part of a complex service chain requires attention.

Isolating Failures Limits Impact: The misconfiguration affected only AT&T traffic while other ISP connections to Microsoft remained operational. This isolation—whether by design or circumstance—limited the scope of the disruption. This incident marked the second Microsoft-AT&T service disruption in approximately 13 months involving configuration changes, highlighting the recurring challenges of managing complex network interconnections at scale. NetOps teams should consider how failures in one part of their infrastructure might be contained to minimize broader impact.

Configuration Management Requires Caution: While the impact was limited to AT&T customers, the simultaneous nationwide effect indicated the configuration issue affected all connection points at once. NetOps teams should recognize that even when systems are designed to isolate failures, centralized configuration management can still cause widespread impact if a problematic change is pushed broadly. Staged rollouts, comprehensive testing, and rapid rollback capabilities remain essential—particularly for changes affecting critical network connections where issues can have immediate, widespread customer impact.

Vodafone Outage

On October 13, Vodafone UK experienced a major outage beginning around 1:30 PM (UTC) that affected broadband, 4G, and 5G services for approximately two hours, though some customers experienced extended disruption. The incident appeared to be confined to U.K. network infrastructure, with Vodafone identifying the cause as a software issue with one of its vendor partners.

What Happened During the Vodafone Outage?

The disruption impacted IP-based services across Vodafone's UK network, rendering Internet connectivity unavailable for both mobile and broadband customers. Vodafone's own website (vodafone.co.uk) and customer service systems also became inaccessible during the incident. However, Vodafone confirmed that 2G voice calls and SMS messaging continued functioning throughout the incident, suggesting the failure primarily impacted IP packet-switched services while circuit-switched infrastructure remained operational.

Recovery began approximately two hours after the initial incident, with services gradually restored as Vodafone implemented remediation measures.

What Did the Network Data Reveal?

ThousandEyes’ analysis of BGP routing data revealed significant control plane disruption during the incident. Both AS25135 (Vodafone UK Packet Backbone Network) and AS5378 (Vodafone UK's external-facing autonomous system) showed significant route withdrawals, with announced IP address space dropping to near-zero levels. Corresponding BGP announcement activity spiked significantly, with volumes reaching 25,000-27,000 announcements as routes were withdrawn and subsequently re-announced during recovery.

The simultaneous withdrawal of routes from both autonomous systems indicated a control plane failure affecting Vodafone's ability to advertise its address space to the global Internet. This pattern is consistent with either a BGP software failure, a routing policy error, or an infrastructure failure affecting the systems responsible for maintaining route advertisements.

How Does Vodafone's Network Architecture Work?

To understand the impact pattern, it's helpful to understand how Vodafone's autonomous systems (ASes) appear to be structured. AS25135 appears to function as an internal backbone network that peers primarily with AS5378, which in turn serves as the outward-facing AS handling Internet peering and transit. In typical operations, externally visible routes originate from AS5378, even though traffic flows through AS25135's backbone infrastructure.

The fact that both autonomous systems simultaneously withdrew routes suggests the failure affected a shared dependency—likely the BGP infrastructure, routing software, or control plane systems that both ASes relied upon to maintain their route advertisements.

The Circuit-Switched vs. Packet-Switched Divide

The clean demarcation between services that continued operating and those that failed provides important clues. Circuit-switched services (2G voice calls and SMS) continued functioning, while all IP-based services failed. This pattern is characteristic of a failure in the IP routing or control plane infrastructure, rather than a complete network collapse affecting all service types.

Additionally, based on connectivity observations during the incident, vodafone.com (the international site) appeared to remain reachable while vodafone.co.uk (the U.K.-specific site) became unreachable. This suggests these services may operate on separate infrastructure or DNS zones, with the .co.uk infrastructure more tightly coupled to the affected AS infrastructure.

What Seemed To Cause the Vodafone Outage?

Based on the observed failure patterns and Vodafone's statement about a vendor software issue, several scenarios could explain the incident:

BGP Software Failure: A software defect in BGP routing software used across both autonomous systems could have caused route withdrawals. If both AS25135 and AS5378 relied on the same routing software platform or central route servers, a software crash or malfunction would explain the simultaneous impact.

Routing Policy Error: A configuration change or automated policy update affecting both ASes could have inadvertently withdrawn routes. Such changes can propagate quickly across routing infrastructure, especially in systems with centralized configuration management.

Control Plane Infrastructure Failure: A failure in shared control plane infrastructure—such as route reflectors, centralized routing controllers, or BGP session management systems—could have disrupted route advertisements from both autonomous systems simultaneously.

The rapid recovery timeframe (approximately two hours) suggests the issue was likely software-related rather than hardware failure, as software issues can often be resolved through restarts, rollbacks, or configuration corrections more quickly than hardware replacements would allow.

What Can ITOps Teams Learn From the Vodafone Outage?

While the Vodafone incident was confined to U.K. infrastructure, it offers valuable lessons for any organization dependent on complex infrastructure—whether managed internally or by external providers.

Understand Your Provider's Architecture—Even When You Can't Control It: For organizations relying on ISPs, cloud providers, or other infrastructure services, understanding how your provider's network is structured can help you interpret outages and plan responses. BGP monitoring tools can provide visibility into routing changes that affect your connectivity, even when those changes occur outside your direct control. When routes disappear or major BGP activity occurs, this information helps ITOps teams understand whether issues are local to their infrastructure or whether they stem from upstream providers.

Multiple Service Delivery Paths Provide Resilience: The continued operation of certain Vodafone services during the IP infrastructure failure demonstrates an important principle: Maintaining multiple service delivery mechanisms—even if they use different underlying technologies—can provide critical resilience during major incidents. For enterprise ITOps teams, this translates to architectural considerations like multi-cloud strategies, diverse connectivity paths, or maintaining alternative communication channels that don't share the same infrastructure dependencies.

Regional Failures Can Have Global Business Impact: While this outage was technically regional (confined to U.K. infrastructure), it also had significant business impact for organizations based in other regions who had U.K. operations or customers. ITOps teams should consider geographic concentration risks in their architecture. Even if your infrastructure is globally distributed, regional failures in areas where you have concentrated customers, data centers, or critical operations can be as impactful as a global outage.

Third-party Issues Require First-party Preparedness: Vodafone attributed the incident to a vendor software issue—a reminder that failures in components you don't directly control can still significantly impact your services. ITOps teams should maintain visibility into their critical dependencies (whether internal vendors, cloud providers, ISPs, or SaaS platforms) and have response procedures that account for scenarios where the root cause is outside their remediation control. This includes communication plans for explaining provider-side outages to your users and customers.

Control Plane Failures Are Different From Performance Degradation: This incident demonstrated how control plane failures (BGP route withdrawals) create different failure patterns than data plane issues (packet loss, latency). Control plane failures often cause immediate, complete service loss rather than gradual degradation. For ITOps teams, this means that monitoring should include control plane health indicators—not just performance metrics—and incident response procedures should account for scenarios where services fail completely and suddenly rather than degrading over time.

YouTube Outage

On October 15, YouTube experienced an outage affecting video playback across YouTube, YouTube Music, and YouTube TV. The disruption began around 11:00 PM (UTC) on October 15 and lasted approximately 90 minutes until YouTube confirmed resolution at approximately 12:30 AM (UTC) on October 16. The incident affected both web and mobile platforms globally.

What Happened During the YouTube Outage?

Video playback failed across the platform despite the interface appearing to function normally. According to ThousandEyes data, videos were unable to load. In addition, error messages were reportedly seen: "An error occurred. Please try again later" on web platforms and "Something went wrong" in mobile applications. The platform remained navigable—search functionality, channel pages, and browsing features continued to work—but video playback consistently failed across all affected services.

The issue also impacted YouTube TV services, as well as YouTube Music, where streaming failed though offline downloads continued to play normally.

YouTube acknowledged the incident at approximately 12 AM (UTC) and confirmed full resolution approximately 30 minutes later. At the time of writing, YouTube hasn’t released an official explanation for the root cause.

What Did ThousandEyes Observe?

ThousandEyes screenshot showing page components loaded successfully, but video playback failed to initialize, showing only a black screen — Figure 3. From ThousandEyes observations, page components loaded successfully, but video playback failed to initialize, showing only a black screen

ThousandEyes synthetic transaction monitoring revealed a specific failure pattern during the outage period. The YouTube page structure and UI components loaded successfully—the video player frame rendered, navigation elements appeared, and static content displayed normally with successful HTTP 200 responses for page components including the watch page, base JavaScript files, and other static resources.

However, video playback failed to initiate, with the player displaying only a black screen where content should appear. Network waterfall analysis showed the "videoplayback" component requests being cancelled or timing out without returning any HTTP status codes. This pattern indicated that requests were being initiated but never received responses, leading browsers to eventually cancel pending requests after timeout thresholds were exceeded.

What Possible Root Causes Can We Rule Out?

ThousandEyes screenshot displaying that network paths showed no coinciding issues during the YouTube outage — Figure 4. Network paths showed no coinciding issues during the outage, indicating an application-layer failure

Network or content delivery network (CDN) issues seem unlikely root causes. Network infrastructure failures typically manifest as HTTP error codes (503 Service Unavailable, 504 Gateway Timeout, etc.) rather than cancelled requests without status codes. Static content loaded successfully from CDNs, indicating the content delivery infrastructure remained operational. Additionally, ThousandEyes network path analysis showed 100% availability with zero packet loss along paths to YouTube servers, confirming network connectivity remained intact throughout the incident.

A network-level failure affecting global users simultaneously would require systemic infrastructure failure across multiple regions and ISPs. Network monitoring during this timeframe showed no corresponding network issues affecting YouTube's connectivity, and the clean network paths observed from monitoring vantage points further support that this was not a network or CDN infrastructure problem.

Problems with the authentication services also didn’t seem to be the cause. The issue affected public video content that was accessible without login credentials. Platform features requiring authentication—including browsing, search, and account access—continued to function normally, suggesting authentication and session management systems remained operational.

What Might Have Caused the YouTube Outage?

The pattern of successful page load but failed video playback points toward a failure in the backend services responsible for playback initialization. Specifically, requests being cancelled without status codes suggests backend services were accepting connections but not responding—essentially hanging after receiving requests rather than explicitly failing or returning error codes.

This type of issue would likely affect services handling critical video playback functions:

Video Streaming Manifest Generation: YouTube uses adaptive bitrate streaming technologies (DASH/HLS) requiring manifest files—playlists that instruct the video player which video segments to request, in what sequence, and at what quality levels. These manifests must be dynamically generated for each playback session based on available formats, device capabilities, and network conditions. A failure in manifest generation services would prevent the videos from starting to play while allowing page structures to load normally.

Authorized Playback URL Creation: For security and access control, YouTube generates time-limited, cryptographically signed URLs granting temporary permission to access video data chunks stored on CDN edge servers. If services responsible for creating these authorized URLs failed, video players would be unable to retrieve video segments despite underlying content and CDN infrastructure remaining operational.

Playback Session Coordination: Before streaming begins, backend services must coordinate multiple subsystems—validating content access, determining appropriate video formats, initializing analytics tracking, and establishing playback session state. A failure in this coordination layer would manifest as videos failing to initialize despite the interface loading correctly.

Was YouTube’s Recent UI Redesign a Factor?

YouTube began rolling out a major video player redesign on October 13-14, just shortly before this October 15 outage. While we can’t say for sure whether the events were in any way linked, their proximity to each other warrants consideration. UI updates can introduce changes to API call patterns, request sequencing, or backend service dependencies that may lead to potential issues a bit later, not even if no issues appear immediately under normal testing conditions.

The new interface might have altered how videoplayback initialization requests are structured or introduced changes to backend service interactions. Such modifications could lead to issues that only trigger failures after certain conditions are met—whether through gradual resource accumulation, specific request patterns, cache state problems, or race conditions that emerge under production load but weren't detected during testing.

It's worth noting that backend services can experience failures regardless of UI changes—due to software defects, configuration changes, capacity issues, or dependency problems. The timing is notable but not definitive evidence of causation.

What Can ITOps Teams Learn From the YouTube Outage?

While YouTube has not disclosed the technical root cause, this incident offers valuable lessons for organizations managing complex service architectures.

Backend Service Failures Can Hide Behind Healthy Infrastructure: This outage demonstrated how backend application services can fail while underlying infrastructure remains operational. Network paths showed zero packet loss, CDN servers were reachable, and page structures loaded successfully—yet video playback completely failed. ITOps teams should have monitoring in place that covers not just infrastructure health but also application-layer service functionality, particularly for critical user-facing features. Synthetic transaction monitoring that tests complete user workflows, rather than just endpoint availability, can detect these service-layer failures.

Service Architecture Isolation Limits Impact: The ThousandEyes data showed certain platform components loaded successfully (page structure, static assets, JavaScript files) while others failed (video playback initialization). This pattern suggests architectural separation between different service layers. When designing critical systems, consider which components can be decoupled to help ensure that failures in one subsystem don't cascade to unrelated features. This isolation not only limits user impact but also aids in diagnosis by clearly delineating which services are affected.

Rapid Changes Require Extended Observation Periods: While we can’t say for sure whether YouTube’s UI redesign was in any way connected to the outage that happened shortly after, their close proximity does potentially raise a conversation about the importance of careful post-deployment monitoring. Not all issues manifest immediately after changes are deployed. Some problems only emerge after specific conditions are met: particular usage patterns, cache states, resource accumulation, or race conditions that take time to surface. ITOps teams should implement extended monitoring periods after significant updates, recognizing that success can only be declared after sufficient observation time under production load patterns.

Global Services Need Resilient Failover Architecture: The simultaneous worldwide impact suggests this failure occurred in centralized services without effective regional failover. For services operating at global scale, architecture should include regional service deployments with the ability to failover traffic to healthy regions when localized service failures occur. This prevents single points of failure from creating global outages and provides resilience against service-layer issues.

Functional Monitoring Complements Infrastructure Monitoring: This outage illustrated a critical distinction between infrastructure health and service functionality. Traditional infrastructure monitoring showed everything working correctly—the site was reachable, pages loaded, navigation functioned normally. However, the core function that makes YouTube valuable (video playback) had completely failed. ITOps teams need monitoring that validates critical business functions, not just infrastructure availability. A service can be "up" from an infrastructure perspective while being effectively "down" from a user perspective if its primary function is broken.

User Experience Defines Service Availability: From an operational perspective, infrastructure metrics showed healthy systems. From a user perspective, the service was completely down. This incident reinforces that monitoring and service level agreements (SLAs) should be defined based on actual user capabilities (can users accomplish their goals?) rather than intermediate technical metrics (are servers responding to health checks?). Service availability must be measured by whether users can successfully complete critical workflows, not just whether infrastructure components are operational.

By the Numbers

Let’s close by taking a look at some of the global trends that ThousandEyes observed over recent weeks (October 6 - 19) across ISPs, cloud service provider networks, collaboration app networks, and edge networks.

Global Outages

From October 6-12, ThousandEyes observed 185 global outages, representing an 18% decrease from 226 the prior week (September 29 - October 5). This continued the downward trend that began at the end of September, moving further away from the elevated plateau of around 300 outages per week that had characterized most of September.

During the week of October 13-19, global outages experienced a significant 42% decline, dropping to 107.

United States Outages

The United States saw outages decline to 113 during October 6-12, representing a 14% decrease from the previous week's 132.

During October 13-19, U.S. outages decreased 42%, dropping to 65. This notable decline mirrored the broader global trend of significantly reduced network disruptions.

Over the two-week period from October 6-19, the United States accounted for 61% of all observed network outages, representing a slight increase from the 56% observed in late September, though occurring during a period of overall declining disruption levels.

Bar chart showing global and U.S. network outage trends over eight recent weeks, August 25 - October 19, 2025 — Figure 5. Global and U.S. network outage trends over eight recent weeks