ThousandEyes actively monitors the reachability and performance of thousands of services and networks across the global Internet, which we use to analyze outages and other incidents. The following analysis of the LinkedIn outage on August 5, 2024, is based on our extensive monitoring, as well as ThousandEyes’ global outage detection service, Internet Insights. See how the outage unfolded in this analysis—more updates will be added as we have them.
Outage Analysis
On August 5th, Microsoft experienced an incident that impacted the availability of LinkedIn for some users around the globe. The outage was first observed around 18:25 UTC and manifested as elevated packet loss in Microsoft’s network, as well as DNS resolution timeouts and HTTP errors.
The resultant disruption to LinkedIn lasted a little over an hour. LinkedIn confirmed in a status update that users were able to reconnect to its service by approximately 19:40 UTC. ThousandEyes observed some residual network latency issues after the reported resolution; however, they did not appear to prevent users from interacting with LinkedIn services, and the issues eventually resolved around 22:30 UTC.
Explore an interactive view of the outage as seen in the ThousandEyes platform (no login required).
The incident also impacted other Microsoft services, including Microsoft Teams and Microsoft 365. Microsoft issued a statement indicating that a configuration change to Azure Front Door (AFD) resulted in a disruption to some of its own applications leveraging AFD as their content delivery network (LinkedIn is a Microsoft-owned application). No external commercial customers of Azure Front Door appeared to have been impacted.
Explore an interactive view of the outage as seen in the ThousandEyes platform (no login required).
The Outage Begins: Packet Loss and Connectivity Disruptions
As the incident began to unfold, ThousandEyes detected elevated packet loss within Microsoft’s network (see figure 3) that impacted the reachability of application servers, as well as essential services like SSL and DNS, leading to timeouts and error messages for users trying to access LinkedIn.
The connectivity and timeout issues were intermittent and impacted users unevenly. DNS issues, in particular, seemed to only impact a smaller subset of overall impacted LinkedIn customers. One potential reason that the DNS timeouts did not affect all users equally could be due to the fact that LinkedIn is leveraging two authoritative DNS providers, Microsoft and NS1. Users querying NS1 nameservers during the incident would not have been impacted by the connectivity issues in Microsoft’s network; however, once receiving a DNS response, those same users could have been impacted by the issues affecting Azure Front Door.
The Outage Continues: Suboptimal Routing Adds to Reachability Issues
According to a blog post from LinkedIn, the company utilizes Azure Front Door (AFD) for its CDN infrastructure. AFD plays a critical role in ensuring users connect to LinkedIn via servers close to where they are located. For example, before the incident, a client in Los Angeles, CA, was being routed to a server within Microsoft’s network in Los Angeles (see figure 6).
As the outage continued, routing anomalies started to emerge, where users were being routed to servers outside of their region. While this routing, in some cases, did not coincide with packet loss, it did add extra distance, resulting in increased latency and more frequent connection timeouts.
Routing was also unstable during the outage. For example, this same Los Angeles location was also routed to an edge server in Iowa in the US, where it was dropped before reaching the LinkedIn service.
Recovery Begins
At approximately 19:25 UTC, Azure announced the completion of the rollback of a configuration change identified as the trigger of the issue. Around the same time, LinkedIn services began to show signs of recovery, although routing did not initially return to the paths seen before the outage. For example, for the Los Angeles location observed above, instead of routing to an IP address within Microsoft's AS 8068 range, the destination changed to an address within LinkedIn's AS 14413 IP range.
Following the restoration of connectivity and access to LinkedIn at approximately 19:40 UTC, ThousandEyes observed an increase in page load times and HTTP 429 errors in various regions. Users were able to interact with LinkedIn services; however, performance would have been suboptimal.
ThousandEyes observed traffic being routed through LinkedIn AS 14413 until approximately 22:30 UTC, when traffic reverted to Microsoft AS 8068 (its original pre-outage state).
Lessons and Takeaways
In cases like this, when digital experiences are affected, it's crucial to discern which services or functions are affected, where the issues are arising, and what areas are being impacted. Having this comprehensive awareness lays a strong foundation for making well-informed decisions about addressing the problems and optimizing for the future. Intermittent issues can be especially time consuming to identify and resolve, particularly when they involve multiple types of symptoms. It's important to consider these symptoms collectively rather than in isolation to prevent a misdiagnosis.
When it comes to ensuring reliability and maintaining business continuity, redundancy is essential. However, the reality is that not all outages can be avoided. So it's important not to become complacent, even with full planning and redundancy options in place. Automated redundancy cannot always be relied upon to mitigate issues, especially if they are focused on areas that affect a single resource area. Once it's accepted that outages will occur, a set of processes or plans—built around an impact assessment that will inform you if the outage is affecting all users or only a subset, and whether it is affecting a particular region or function within the application—will help prioritize the response effort.
[August 5, 2024, 1:30 PM PT]
ThousandEyes can confirm that starting at about 18:25 UTC, some global users attempting to access the LinkedIn app were unable to reach the service due to elevated traffic loss in Microsoft’s network and DNS resolution timeouts. As of 19:30 UTC, the issue appeared to resolve for some impacted users. ThousandEyes will continue to track this incident and will provide updates as we learn more.