Troubleshooting Tips & Outages at Zoom, Spotify & More

This is The Internet Report, where we analyze outages and trends across the Internet through the lens of ThousandEyes Internet and Cloud Intelligence. I’ll be here every other week, sharing the latest outage numbers and highlighting a few interesting outages. As always, you can read the full analysis below or listen to the podcast for firsthand commentary.

Internet Outages & Trends

On April 16, three major tech platforms—SAP Concur, Spotify, and Zoom—experienced significant service disruptions, affecting users globally. Although all three incidents had similar impacts in that users were unable to successfully interact with the services, the actual causes differed and were not always obvious. It was only by analyzing multiple signals across the full service delivery chain that we could identify the specific fault domains.

These disruptions serve as helpful reminders of key troubleshooting best practices for IT operations teams—from the importance of understanding your entire service delivery chain and avoiding diagnosing based on a single symptom to the criticality of monitoring public DNS infrastructure, rather than just your own nameservers. Taking the right troubleshooting steps empowers you with the insight your ITOps team needs to implement the most appropriate mitigation plans and reduce the impact on user experience.

Let's examine what happened with SAP Concur, Spotify, and Zoom, and dive deeper into the troubleshooting best practices they raise. We’ll also take a look at an earlier Vanguard UK issue from April 7.

Read on to learn more, or use the links below to jump to the sections that most interest you:

SAP Concur disruption
Spotify outage
Zoom outage
Vanguard outage
By the numbers

SAP Concur Disruption

On April 16, SAP Concur—a SaaS platform that provides travel, expense, and invoice management services for businesses—experienced a performance disruption. This issue primarily impacted instances hosted in the US2 data center, hindering users' ability to access services related to expense management, travel, invoicing, requests, and imaging.

Screenshot showing the SAP Concur status page highlighting a service disruption impacting US2 on April 16 — Figure 1. SAP Concur status page highlighting service disruption impacting US2 on April 16

The disruption occurred in two phases. ThousandEyes first observed the issue around 8:10 PM (UTC). After the problem initially seemed to resolve around 8:42 PM (UTC), services appeared stable and operational for approximately one hour and 15 minutes before exhibiting the same symptoms again. After this second phase of disruptions, all services appeared to recover completely around 10:20 PM (UTC).

Explore the SAP Concur disruption further on the ThousandEyes platform (no login required).

ThousandEyes screenshot showing intermittent disruption impacting SAP Concur in two occurrences over about 2 hours — Figure 2. Intermittent disruption was observed in two occurrences over a period of around two hours

ThousandEyes data indicated that the issue was occurring exclusively at the receiving end of the interaction, a sign that the HTTP response was waiting for a reply that ultimately timed out. The fact that ThousandEyes only observed issues at this stage suggested a problem interacting with the service itself.

ThousandEyes screenshot showing that ThousandEyes observed receive errors during the SAP Concur incident — Figure 3. ThousandEyes observed receive errors during the incident

After observing the disruptions at the HTTP layer, we then checked whether any network conditions were contributing to the disruption by either delaying or dropping requests to SAP Concur services. In this case, network path analysis showed no excessive loss or latency, suggesting that the issue was not related to the network. In the absence of any network-related issues, such as excessive loss or latency, it appeared that there was no correlation between the service disruption and the network paths leading to the SAP Concur services.

When diagnosing an issue, after verifying the integrity of the connectivity to the service, ITOps teams should then work to gain a clearer understanding of what is causing the problem, if it’s not the network. In this case, this involved examining an interaction where the client attempts to log in and retrieve data, such as generating a sample report. By breaking down the service delivery chain and the associated interactions, ThousandEyes confirmed the functionality of key backend processes, including user verification and data retrieval. This approach helps us better identify the fault domain, ruling in or out functions such as authentication, data retrieval, and data posting.

Upon examining this interaction, ThousandEyes observed increases in transaction times that coincided with longer wait times, as well as a correlation with transaction completion errors, where requests failed to complete.

ThousandEyes screenshot showing increases in transaction wait times corresponding to connection loss — Figure 4. Increases in transaction wait times corresponding to connection loss

The incomplete steps in the transaction indicated that the main issue likely lay within the backend services. Various conditions observed during the disruption also suggested backend issues. For instance, after successfully logging into a specific instance, the homepage failed to load and instead reverted to a generic SAP Concur sign-in page, accompanied by an error message stating, "Something went wrong." This situation underscores a couple of key points. First, the services were responding, providing further evidence that the issue wasn’t related to the network. Second, the authentication process appeared to function properly; the redirection to a generic sign-in page along with the error message suggests that the failure occurred after authentication. Otherwise, the transaction would have failed at that point, and the homepage would not have attempted to load.

Screenshot showing users encountered error messages when trying to log in to SAP Concur — Figure 5. Users encountered error messages when trying to log in

The SAP disruption highlights some important troubleshooting best practices for ITOps teams. When addressing a potential issue, the first step is to triage the situation. This involves determining whether a problem truly exists and, if it does, understanding how it is impacting your user base. Once you’ve established that there is indeed an issue, you should break down the user journey to pinpoint the responsible party and the specific fault domain.

However, it's crucial to remember that just because an issue appears at one point in the journey doesn't necessarily mean that it's the cause of the problem. To fully understand the situation, you need to verify the cause and effect. This can be accomplished by examining the entire service delivery chain holistically and considering the context in which the issue arises. By doing so, you can gain a clearer picture and address the root of the problem more effectively.

Spotify Outage

Around 12:20 PM (UTC) on April 16, Spotify users globally experienced service disruptions on web, desktop, and mobile platforms. The issue manifested as timeouts that appeared to render the service virtually unusable. The outage lasted for over three hours, with recovery starting around 2:45 PM (UTC) and the service fully restored by approximately 3:40 PM (UTC).

Users reported issues such as black screens and the inability to play songs, along with limited functionality. While some tracks were accessible on iOS devices, this was only possible if Spotify had cached them prior to the incident. These problems hinted at backend system issues.

During the disruption, ThousandEyes observed delayed page loads and failures to complete the loading process, which may have affected components responsible for managing data related to artists, albums, tracks, and user-specific information, such as playlists and profiles. There was also evidence of backend issues, including a mix of server-side errors like HTTP 504 Gateway Timeout and 502 Bad Gateway errors. However, connectivity to the Spotify frontend appeared to function normally, indicating that the issue was not related to network connectivity.

Explore the Spotify outage further on the ThousandEyes platform (no login required).

ThousandEyes screenshot showing increases in page load time coincided with timeouts and completion issues — Figure 6. Increases in page load time coincided with timeouts and completion issues, indicating backend service issues

The Spotify outage offers valuable insights for ITOps teams aiming to troubleshoot incidents effectively. When assessing data points, it's essential to interpret signals within their context to identify whether issues stem from network infrastructure, applications, backend services, or outside factors. It's crucial to view a single signal, such as a timeout error, as just a symptom rather than concrete evidence of a fault. Therefore, collecting and analyzing multiple signals from both the network and application layers is vital. Additionally, it's important to consider contextual factors, including the timing of any outages, recent changes to the system, user reports, and historical performance data. By piecing together this comprehensive information, teams can enhance their troubleshooting efforts, leading to faster resolutions.

Zoom Outage

Also on April 16, Zoom experienced a complete global outage lasting approximately two hours from 6:25 PM to 8:12 PM (UTC). DNS resolution failures and timeouts affected all services, with the top-level domain (TLD) nameservers lacking records for zoom.us and, as a result, returning NXDOMAIN errors. All subdomains, including customer "vanity" URLs and even Zoom's status page, were impacted.

ThousandEyes screenshot showing that the Zoom outage impacted users globally

The authoritative nameservers for zoom.us, hosted by AWS Route 53, were accessible and appeared to be correctly configured throughout the outage. They successfully returned records for zoom.us when queried directly. However, due to the absence of DNS records at the TLD level, DNS clients weren't directed to the Route 53 authoritative nameservers.

Zoom confirmed that access to the domain zoom.us was blocked due to a restriction imposed by their DNS registrar. This prevented the domain name from resolving to its respective IP address, rendering the domain inaccessible. After the block was lifted, the service came back online at 8:12 PM (UTC), though some disruption continued until 8:30 PM (UTC) when most users could access Zoom again. For those who continued to experience issues, Zoom recommended flushing their DNS cache and trying to reconnect.

This isn’t the first time we’ve reported on DNS-related issues impacting applications. In May 2024, Salesforce experienced a DNS disruption that impacted some customers trying to reach the service. And in July 2021, an outage impacted Akamai’s DNS service, preventing users around the globe from reaching its customers’ sites.

These cases highlight just how critical DNS is to the applications we rely on. And the Zoom outage also emphasizes the importance of monitoring not only your own nameservers, but the public DNS infrastructure as well.

For more insights on the Zoom outage and key takeaways, see ThousandEyes’ full outage analysis blog.

Vanguard Outage

On April 7, Vanguard's investment platform experienced a degradation that seemed to affect U.K.-based customers primarily, although this could have been related to the time of day, as reports of the issues began around 8:00 AM (GMT).

While the Vanguard website remained accessible, users encountered error messages about service unavailability. The problems seemed to affect only specific functions within the service.

Interestingly, the mobile app appeared to be available for checking portfolios and transactions. While it’s not entirely clear if all the same features were available on the mobile app, the fact that users could check accounts suggested the website and app have different backend environments.

While it’s unclear exactly what caused the Vanguard outage, in the financial services space, sometimes outages can be caused by external events that prompt users to log onto the platform at the same time, overwhelming the system. And this can happen in other industries as well. Organizations should have backup plans in place to help maintain service performance when confronted by an unexpected flood of traffic caused by external factors outside their control. And when troubleshooting the cause of an outage, ITOps teams should make sure to look beyond their immediate environment, taking in all relevant data points to determine the cause and take appropriate actions to rectify it.

Additionally, it must be noted that some traffic spike events are more predictable: ticket sales for a popular concert, Black Friday sales, major sporting events, or deadlines like Tax Day. To the extent possible, companies should scale up their service before these events, running tests to make sure they can meet the demand.

Troubleshooting Best Practices

This series of disruptions provides valuable insights for IT professionals. Triage methodology matters. Determining if a problem exists and how it is impacting users should come before diving deeper into technical analysis. Breaking down the journey by examining each step in the service delivery chain helps identify the responsible party and fault domain. Context is crucial when interpreting error signals, as a single error should be viewed as a symptom rather than definitive proof of the underlying cause.

Holistic analysis requires collecting and analyzing multiple signals from both network and application layers while considering the broader context. External events can directly impact system load and normal operations. The Zoom outage demonstrates how fundamental services like DNS are critical, as issues can render entire platforms inaccessible—despite the health of other components. Many outages stem from configuration problems that affect just one component but cascade throughout the system.

Understanding these lessons can help IT teams better respond to service disruptions—whether by switching to alternative networks, implementing mitigation steps, or simply waiting for an outage or degradation to resolve while keeping stakeholders informed.

By the Numbers

Let’s close by taking a look at some of the global trends ThousandEyes observed over recent weeks (April 7-20) across ISPs, cloud service provider networks, collaboration app networks, and edge networks.

In a brief reversal of the downtrend observed in the last fortnight (March 17 - April 6), outages initially rose before declining in the following week. In the first week (April 7-13), ThousandEyes recorded a 38% increase in outages, which rose from 404 to 559. However, outages resumed their previous downward trend in the subsequent week. During the week of April 14-20, outages fell from 559 to 309, marking a 45% decrease compared to the previous week.
The United States followed a similar pattern. Initially, outages increased from 150 to 212, reflecting a 41% rise compared to the previous week. However, like the global trend, U.S. outages also experienced a decrease in the following week. During the week of April 14-20, outages dropped from 212 to 69, representing a 68% decrease.
From April 7 to April 20, an average of 32% of all network outages occurred in the United States, down from the 41% observed in the previous period (March 17 - April 6). This 32% figure departs from the longstanding trend in which U.S.-based outages typically account for at least 40% of all recorded outages.

Bar graph showing global and U.S. network outage trends over eight recent weeks, February 24 through April 20 — Figure 8. Global and U.S. network outage trends over eight recent weeks

The Internet Report

Troubleshooting Tips & Outages at Zoom, Spotify & More

Summary

Internet Outages & Trends

SAP Concur Disruption

Explore the SAP Concur disruption further on the ThousandEyes platform (no login required).

Spotify Outage

Explore the Spotify outage further on the ThousandEyes platform (no login required).

Zoom Outage

For more insights on the Zoom outage and key takeaways, see ThousandEyes’ full outage analysis blog.

Vanguard Outage

Troubleshooting Best Practices

By the Numbers

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

Summary

Internet Outages & Trends

SAP Concur Disruption

Explore the SAP Concur disruption further on the ThousandEyes platform (no login required).

Spotify Outage

Explore the Spotify outage further on the ThousandEyes platform (no login required).

Zoom Outage

For more insights on the Zoom outage and key takeaways, see ThousandEyes’ full outage analysis blog.

Vanguard Outage

Troubleshooting Best Practices

By the Numbers

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

Subscribe to the ThousandEyes Blog

Stay connected with blog updates and outage reports delivered while they're still fresh.

Upgrade your browser to view our website properly.