This is The Internet Report, where we analyze recent outages and trends across the Internet, through the lens of ThousandEyes Internet and Cloud Intelligence. I’ll be here every other week, sharing the latest outage numbers and highlighting a few interesting outages. This week, we’re also featuring a conversation exploring what it takes to deliver great digital experiences in the sports world, with special guest Dave Anderson, a tech industry veteran and co-host of "A Very Melbourne Podcast," which covers the Australian Football League and more. As always, you can read the analysis below or tune in to the podcast for firsthand commentary.
Assuring Great Digital Experiences for Sports Fans
Major sporting events are always logistically complex, but this is even more the case now that digital technology has permeated every part of operational and experience delivery. Venues are highly networked spaces, with everything from ticketing to hospitality services being run and managed digitally. Mobile wayfinding guides ticket holders to their seats; digital signage and displays beam game action to fans; and on-site or mobile production studios bring live feeds to TV and online audiences, domestically and (where licensed) internationally.
This end-to-end complexity, with its multiple dependencies and reliance on Internet infrastructure, can be challenging to oversee and manage. Networks, particularly those that are third-party operated, can be susceptible to a range of operating conditions that can affect the fan experience.
When it comes to live sports, anything that does impact the digital experience is particularly problematic because the events occur in real time. The event, its audience, and the infrastructure that supports the experience delivery—both in person and at home—are all dynamic. Yet there’s only one chance to get it right, and so managing all the variables that contribute to the experience is absolutely critical.
Navigating a Big Year for Sports Events
Fans can have long memories when it comes to content delivery glitches that result in them missing an important moment like a goal or penalty being awarded. And there are—and continue to be—plenty of examples where broadcast issues still occur in the delivery of live sports into people’s homes. For example, a December boxing match experienced audio problems, and TV images didn’t display for part of an English representative soccer match this year.
It’s not just elite-level sports that are impacted by outages. An issue with a grassroots sporting app on game day impacted community sports in Australia earlier this year.
With competition between sports for global audiences continuing to ramp up, the key to delivering for fans is to offer a glitch-free and consistent experience, no matter where the fan is: at a stadium, at home, in a car or airplane, in an office, or out-and-about. Governing bodies, broadcasters, streamers, and fans all want assurance that they’ll get the best experience every time they engage.
For organizations in the digital experience delivery chain, more than ever it’s about having the ability to detect and remediate issues as they arise, and optimizing for every connected experience.
Tune into the podcast for more from The Internet Report team and special guest Dave Anderson on assuring great digital experiences in the sports world.
Internet Outages & Trends
Returning to our regular outage programming, ThousandEyes observed two cloud-related incidents over the past few weeks: one where Microsoft ran into issues recovering from a DDoS attack, and another where “an issue with AWS” caused problems globally for cloud accounting software-as-a-service provider Xero. We also saw another recent Microsoft disruption that impacted LinkedIn. In addition, a major market sell-off that was triggered by events in the U.S. and Japan caused problems for some brokerage and trading platforms. We’ll unpack these below.
Microsoft Azure Services Disruption
Azure Front Door (AFD) and Azure Content Delivery Network (CDN), and downstream services that rely on them, were impacted by an outage on July 30 that reportedly started at 11:45 AM (UTC). However, ThousandEyes observed network issues before then, with parts of the Microsoft network seeing degradation between 10:30 and 11:00 AM (UTC).
Explore the Azure disruption further in the ThousandEyes platform, no login required.
According to Microsoft’s official post-incident explanation, the problems began with a DDoS attack that was detected and automatically mitigated. But, once mitigated, default traffic routing did not resume as expected. This was due to a series of failures, beginning with a local power outage at “one specific site in Europe,” which caused traffic to continue to route through DDoS protection services. Complicating matters, “an unrelated latent network configuration issue caused traffic from outside Europe to be routed to the DDoS protection system within Europe. This led to localized congestion, which caused customers to experience high latency and connectivity failures across multiple regions.”
This is consistent with ThousandEyes’ observations. It was clear there were issues in how traffic was being redistributed following the DDoS mitigation, leading to congestion and dropped packets for customers. The issue was resolved by 2:00 PM (UTC), lasting a bit over two hours.
The incident illustrates that the cause of a disruption can be multifaceted, with multiple factors—including your own mitigation efforts—potentially playing a contributing role. When a disruption occurs, it's crucial to make sure any actions taken to remediate the issue are working as expected and not inadvertently making the issue worse.
Xero Outage
On the same day as the Azure disruption, cloud accounting software provider Xero experienced a six-hour issue that prevented some customers from logging in or navigating the app. The outage resulted in a bad gateway (HTTP 502) error, indicating there was a problem with the communication between the CDN/proxy and backend systems. This type of error is classified as a server-side error and is usually observed when there are issues with receiving a response from the backend systems. In this instance, the backend systems were hosted on AWS. Xero has confirmed in status updates that the problems were "related to an issue with AWS."
While ThousandEyes observed impacts on users globally, it was not a total outage–-although that does not necessarily provide any relief to those affected. That basically tells us that it wasn't a case of AWS being down—more that it was an issue with a specific service provided by AWS and leveraged by Xero for some services and functions.
Explore the Xero outage further in the ThousandEyes platform (no login required).
During the issue, ThousandEyes observed HTTP and Receive errors, suggesting that this was not a network issue and that the domain itself was reachable. When combined with an increase in page load time and the fact that only two web components loaded—which indicates that edge servers were reachable and responsive, but unable to load all required components—this further reinforced our opinion that that the issue was with the backend.
On a side note, ThousandEyes also observed that some services appeared to operate normally, meaning that the functionality depended on where and how the user accessed the AWS region/network.
Microsoft Incident and LinkedIn Disruption
On August 5, some LinkedIn users around the globe encountered issues with the platform when Microsoft experienced an incident that impacted LinkedIn’s availability. First observed around 6:25 PM (UTC), the outage manifested as elevated packet loss in Microsoft’s network, as well as DNS resolution timeouts and HTTP errors. The incident also appeared to impact some of Microsoft’s other services, including Microsoft Teams and Microsoft 365.
The disruption to LinkedIn lasted for just over an hour. In a status update, LinkedIn confirmed that users were able to reconnect to its service by approximately 7:40 PM (UTC). ThousandEyes observed some lingering network latency issues after the reported resolution. However, these issues did not appear to prevent users from interacting with LinkedIn services, and they eventually resolved around 10:30 PM (UTC).
For further insights about this incident, see this dedicated outage analysis blog.
Brokerage, Trading Platform Issues
A number of online trading platforms used by retail investors experienced issues on August 5, coinciding with a major stock self-off across global markets.
Charles Schwab confirmed it was among the operators to have problems. “A technical issue experienced by some clients has been resolved,” it said on X. “We apologize for the inconvenience.” Vanguard and Fidelity Investments were also reportedly impacted, with regulators observing proceedings.
Large events on financial markets have always had the potential to impact trading platforms—and that goes for regular stocks, as well as cryptocurrencies during the height of interest in digital currencies a few years ago. The financial sector is inherently complex and its digital operations are no exception; it’s worth taking note of recent guidance in this space aimed at optimizing customer’s digital experiences and mitigating the effects of disruptions.
By the Numbers
Let’s close by taking a look at some of the global trends ThousandEyes observed across ISPs, cloud service provider networks, collaboration app networks, and edge networks over the past two weeks (July 22 - August 4):
-
The upward trend observed across July continued into the first week of this period (July 22-28), with outages increasing by 9% compared to the previous week, rising from 187 to 204. However, this upward trend came to an end the following week, with outages decreasing from 204 to 183 between July 29 and August 4, marking a 10% decrease compared to the previous week.
-
The United States did not reflect this trend. Instead, the increases observed throughout much of July ended in the first week of this period (July 22-28), with outage numbers decreasing 34%. However, the upward trend resumed the next week, with outages rising 28% from July 29 to August 4.
-
Despite this rise in outages in the United States during the second week of the period, U.S.-centric outages made up less than 40% of all observed global outages. From July 22 to August 4, only 35% of network outages occurred in the United States, compared to 48% in the preceding two weeks (July 8 to 21).
-
Looking at the month-over-month data, in July, 816 outages were observed worldwide, an 8% decrease from the 890 reported in June. However, there was a slight increase in outages in the United States, rising from 308 in June to 334 in July, marking an 8% increase.