LinkedIn Outage Analysis: August 5, 2024

ThousandEyes actively monitors the reachability and performance of thousands of services and networks across the global Internet, which we use to analyze outages and other incidents. The following analysis of the LinkedIn outage on August 5, 2024, is based on our extensive monitoring, as well as ThousandEyes’ global outage detection service, Internet Insights. See how the outage unfolded in this analysis—more updates will be added as we have them.

Outage Analysis

On August 5th, Microsoft experienced an incident that impacted the availability of LinkedIn for some users around the globe. The outage was first observed around 18:25 UTC and manifested as elevated packet loss in Microsoft’s network, as well as DNS resolution timeouts and HTTP errors.

The resultant disruption to LinkedIn lasted a little over an hour. LinkedIn confirmed in a status update that users were able to reconnect to its service by approximately 19:40 UTC. ThousandEyes observed some residual network latency issues after the reported resolution; however, they did not appear to prevent users from interacting with LinkedIn services, and the issues eventually resolved around 22:30 UTC.

Screenshot of the ThousandEyes platform showing increased page load times, loss, and latency for LinkedIn lasting for over an hour — Figure 1. The disruption for LinkedIn lasted just over an hour, with residual performance issues observed until around 22:30 UTC

Explore an interactive view of the outage as seen in the ThousandEyes platform (no login required).

The incident also impacted other Microsoft services, including Microsoft Teams and Microsoft 365. Microsoft issued a statement indicating that a configuration change to Azure Front Door (AFD) resulted in a disruption to some of its own applications leveraging AFD as their content delivery network (LinkedIn is a Microsoft-owned application). No external commercial customers of Azure Front Door appeared to have been impacted.

Screenshot of ThousandEyes showing impacts to multiple Microsoft services, including Office 365 and Microsoft Teams — *Figure 2. Some services operating within Microsoft AS 8068 experience connection issues and service timeouts*

Explore an interactive view of the outage as seen in the ThousandEyes platform (no login required).

The Outage Begins: Packet Loss and Connectivity Disruptions

As the incident began to unfold, ThousandEyes detected elevated packet loss within Microsoft’s network (see figure 3) that impacted the reachability of application servers, as well as essential services like SSL and DNS, leading to timeouts and error messages for users trying to access LinkedIn.

Screenshot of ThousandEyes Path Visualization showing forwarding loss observed within the Microsoft network — *Figure 3. Forwarding loss observed in Microsoft's network during the incident*

Screenshot of ThousandEyes showing errors across multiple regions during the outage — *Figure 4. Multiple regions experienced SSL, DNS, and connection related issues*

The connectivity and timeout issues were intermittent and impacted users unevenly. DNS issues, in particular, seemed to only impact a smaller subset of overall impacted LinkedIn customers. One potential reason that the DNS timeouts did not affect all users equally could be due to the fact that LinkedIn is leveraging two authoritative DNS providers, Microsoft and NS1. Users querying NS1 nameservers during the incident would not have been impacted by the connectivity issues in Microsoft’s network; however, once receiving a DNS response, those same users could have been impacted by the issues affecting Azure Front Door.

Screenshot of the DNS records for LinkedIn.com — *Figure 5. DNS records for linkedin.com domain shows two DNS providers are used*

The Outage Continues: Suboptimal Routing Adds to Reachability Issues

According to a blog post from LinkedIn, the company utilizes Azure Front Door (AFD) for its CDN infrastructure. AFD plays a critical role in ensuring users connect to LinkedIn via servers close to where they are located. For example, before the incident, a client in Los Angeles, CA, was being routed to a server within Microsoft’s network in Los Angeles (see figure 6).

Screenshot of ThousandEyes Path Visualization showing conditions prior to the outage — *Figure 6. The network path before the outage shows expected behavior and routing*

As the outage continued, routing anomalies started to emerge, where users were being routed to servers outside of their region. While this routing, in some cases, did not coincide with packet loss, it did add extra distance, resulting in increased latency and more frequent connection timeouts.

Screenshot of ThousandEyes Path Visualization showing unusual traffic behavior — *Figure 7. During the outage, traffic from some Los Angeles users is routed to Tokyo, Japan*

Routing was also unstable during the outage. For example, this same Los Angeles location was also routed to an edge server in Iowa in the US, where it was dropped before reaching the LinkedIn service.

Recovery Begins

At approximately 19:25 UTC, Azure announced the completion of the rollback of a configuration change identified as the trigger of the issue. Around the same time, LinkedIn services began to show signs of recovery, although routing did not initially return to the paths seen before the outage. For example, for the Los Angeles location observed above, instead of routing to an IP address within Microsoft's AS 8068 range, the destination changed to an address within LinkedIn's AS 14413 IP range.

Screenshot of ThousandEyes Path Visualization showing traffic was restored to a new destination — *Figure 9. The initial network path following service restoration leads to a server in AS 14413*

Following the restoration of connectivity and access to LinkedIn at approximately 19:40 UTC, ThousandEyes observed an increase in page load times and HTTP 429 errors in various regions. Users were able to interact with LinkedIn services; however, performance would have been suboptimal.

Screenshot of ThousandEyes showing increased page load time and latency after connectivity was restored — *Figure 10. An increase in latency and page load time was observed after the connectivity was restored*

ThousandEyes observed traffic being routed through LinkedIn AS 14413 until approximately 22:30 UTC, when traffic reverted to Microsoft AS 8068 (its original pre-outage state).

Screenshot of ThousandEyes showing the network path and performance retuned to pre-outage conditions — *Figure 11. The network path and performance returned to pre-outage conditions around 22:30 UTC*

Lessons and Takeaways

In cases like this, when digital experiences are affected, it's crucial to discern which services or functions are affected, where the issues are arising, and what areas are being impacted. Having this comprehensive awareness lays a strong foundation for making well-informed decisions about addressing the problems and optimizing for the future. Intermittent issues can be especially time consuming to identify and resolve, particularly when they involve multiple types of symptoms. It's important to consider these symptoms collectively rather than in isolation to prevent a misdiagnosis.

When it comes to ensuring reliability and maintaining business continuity, redundancy is essential. However, the reality is that not all outages can be avoided. So it's important not to become complacent, even with full planning and redundancy options in place. Automated redundancy cannot always be relied upon to mitigate issues, especially if they are focused on areas that affect a single resource area. Once it's accepted that outages will occur, a set of processes or plans—built around an impact assessment that will inform you if the outage is affecting all users or only a subset, and whether it is affecting a particular region or function within the application—will help prioritize the response effort.

[August 5, 2024, 1:30 PM PT]

ThousandEyes can confirm that starting at about 18:25 UTC, some global users attempting to access the LinkedIn app were unable to reach the service due to elevated traffic loss in Microsoft’s network and DNS resolution timeouts. As of 19:30 UTC, the issue appeared to resolve for some impacted users. ThousandEyes will continue to track this incident and will provide updates as we learn more.

Outage Analyses

LinkedIn Outage Analysis: August 5, 2024

Summary

Outage Analysis

Explore an interactive view of the outage as seen in the ThousandEyes platform (no login required).

Explore an interactive view of the outage as seen in the ThousandEyes platform (no login required).

The Outage Begins: Packet Loss and Connectivity Disruptions

The Outage Continues: Suboptimal Routing Adds to Reachability Issues

Recovery Begins

Lessons and Takeaways

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

related blogs