This is The Internet Report, where we analyze recent outages and trends across the Internet, through the lens of ThousandEyes Internet and Cloud Intelligence. I’ll be here every other week, sharing the latest outage numbers and highlighting a few interesting outages. As always, you can read the full analysis below or tune in to the podcast for firsthand commentary.
Internet Outages & Trends
During high-traffic seasons like Black Friday or the launch of a much-anticipated product, maintaining quality digital experiences for customers is critical. We’ve all heard stories of websites that crashed during a major sale—leaving online shoppers unable to make their purchases. To avoid a complete breakdown like this, companies sometimes use various traffic management strategies, like digital waiting rooms, during high-traffic periods.
Replicating an approach that physical stores sometimes use to limit foot traffic during busy periods, digital waiting rooms have shoppers bide their time in a virtual waiting space before being admitted to the site. This reduces the amount of traffic the site has to support at any one time, and it allows a gradual ramp to hopefully avoid overloading any infrastructure components involved.
This approach has both pros and cons. On the one hand, some users’ experiences may be less immediate than they’d expected. However, in the end, they’re still able to complete their purchase, rather than being confronted by a totally broken process. And a slower experience for some hopefully maximizes the number of people able to have a positive experience throughout, instead of the process being fully stalled for everyone.
Online ticketing platforms offer a prime example of virtual waiting rooms and other traffic management strategies in action. Ticketing websites can see enormous fluctuations in traffic levels, depending on the events they are selling tickets for. During the recent Oasis reunion tour ticket sales, ThousandEyes observed different approaches to managing the traffic between various ticketing platforms, illustrating the tradeoffs ITOps teams have to make as they try to maintain the best possible digital experience for as many users as possible during high-traffic seasons.
And it wasn’t just ticketing platforms that had to deal with anomalous traffic patterns in recent weeks. Users experienced issues accessing Microsoft’s cloud services when attempting to route traffic over AT&T peering infrastructure, while a route leak, performance incident, and even a fire also caused problems for ISPs, cloud-based application makers, and cloud service providers, respectively.
Read on to learn about all these disruptions, as well as current traffic management strategies, or use the links below to jump to the sections that most interest you:
Oasis Reunion Ticket Issues
For many events, ticket demand sometimes outstrips supply so technical traffic management techniques are used to triage the sudden influx of would-be purchasers. Like other eagerly-anticipated shows before it, online ticket sales for the Oasis reunion tour were likely going to trigger some of these traffic mechanisms. What was interesting is how each of the ticket-selling platforms employed traffic management techniques and processes to control the flow of traffic through their systems.
ThousandEyes observed load building up long before the official ticket sales opening at 8:00 AM (UTC) / 9:00 AM (BST) on August 31, which manifested as an increase in response times across all ticket agencies from around 6:45 AM (UTC).
The official ticketing services selling Oasis tickets—Ticketmaster, See Tickets, and Gigs and Tours—each operate a form of a “digital waiting room” that includes a purchasing queue. See Tickets also appeared to implement rate-limiting measures during the Oasis sale.
When tickets went on sale at 8:00 AM (UTC), ThousandEyes immediately observed some requests returning HTTP 503 service unavailable messages, which coincided with a slight drop in the number of objects that loaded on the page. Around 15 minutes after ticket sales officially opened, ThousandEyes started to see some requests that appeared to have succeeded, rather than timeout or return an error code. This pattern continued through the sale period, with an increase in page load and subsequent wait times, mixed with an occasional timeout.
Ticketmaster’s Approach
Ticketmaster appeared to be leveraging Fastly as its CDN. ThousandEyes observed good connectivity up to the CDN, but then saw intermittent timeouts and 503 "maximum connection” backend errors before they sold out of their ticket allocation, which indicates a slow or unresponsive backend. These 503 errors are typically server side and suggest that network connectivity and reachability are good. However, they indicate that the maximum number of concurrent connections to the origin server may have been reached.
Any Oasis fan that encountered the 503 error would likely have seen their transaction disrupted, as calls and requests were being made from the front page to backend services, where the errors were occurring.
As the sales period progressed, requests were redirected to a waiting room until access could be granted.
Explore Ticketmaster’s performance further through the ThousandEyes platform (no login required).
See Tickets’ Approach
Other authorized ticket sellers like See Tickets and Gigs and Tours also operate a similar digital waiting room. With See Tickets, in particular, ThousandEyes observed disruption with service unavailable messages and HTTP 429 “Too Many Requests” errors, which suggest that rate-limiting was being applied to the domain.
Dive deeper into See Tickets’ performance through the ThousandEyes platform (no login required).
Page load time and wait time also increased prior to the opening of sales, which is likely the result of people trying to get in the queue early. By 7:25 AM (UTC), 35 minutes before opening, page load time continued to increase, and HTTP 500 internal system errors began to return, as ThousandEyes observed.
Once tickets went on sale, ThousandEyes observed HTTP 429 errors, which often indicates that there are too many requests for data being made. It’s also likely that a rate-limiting process to control the traffic flow is causing these errors.
Users were also presented with busy and error messages while trying to load the page, indicating that the site was still reachable and online, but just experiencing excessive load.
The traffic management techniques used to handle this temporary increase in website traffic demonstrates the compromises ITOps teams sometimes make as they try to ensure the best possible digital experience for as many users as possible during peak traffic periods. The placement of traffic management in the purchasing process likely affected the user experience in different ways. Some users may have experienced the biggest delays right at the beginning while trying to access the purchasing process. Others who were able to proceed may have encountered delays at different stages of the purchasing flow, such as seat selection, data entry, or payment.
Microsoft Outage
At approximately 11:40 AM (UTC) on September 12, some users experienced issues when attempting to reach Microsoft services, such as Microsoft 365. ThousandEyes observed significant packet loss, as well as connection timeouts within Microsoft’s network during the incident. From ThousandEyes’ perspective, the problems were limited to a subset of users connecting to Microsoft’s network directly from or through the AT&T peering point.
The incident was resolved by approximately 1:20 PM (UTC). The timeline of events that ThousandEyes observed closely aligns with Microsoft’s official status notifications, which report a duration between 11:46 AM and 1:14 PM (UTC).
Explore the Microsoft outage further in the ThousandEyes platform (no login required).
Microsoft suggested that the root cause was an unspecified change made by AT&T that was then rolled back. AT&T later confirmed a “brief disruption connecting to some Microsoft services.”
During the incident, ThousandEyes observed high forwarding loss in the Microsoft network, which affected various Microsoft services. This incident appeared to have mostly a regional impact on AT&T customers in the U.S. or users whose ISPs used the AT&T peer network. Interestingly, Azure-hosted services seemed to be unaffected, even those utilizing the affected AT&T-Microsoft peering point.
Akamai Outage
On September 5, multiple broadband providers in the U.K. encountered issues reaching some Akamai services. The packet loss coincided with the appearance of Autonomous System (AS) 7473 (Singtel) in the path. It appears that the introduction of the AS into the inter-European route redirected traffic between European destinations through Singapore, which caused unexpected problems due to the unanticipated traffic path or additional load.
Explore the Akamai outage further in the ThousandEyes platform (no login required).
The cause of this incident appeared to be a BGP route leak, which briefly altered traffic patterns to the Akamai services. When we looked at the BGP routing during that time, ThousandEyes observed BGP path changes due to what appeared to be a service provider erroneously inserting themselves into the path. The Internet Engineering Task Force (IETF) defines a BGP route leak, in RFC 7908, as the unauthorized spread of routing announcements beyond their intended scope. In other words, it occurs when an AS mistakenly propagates a learned BGP route to another AS, violating the policies of the intended receiver, sender, and/or one of the ASes along the preceding AS path. This can lead to the redirection of traffic through unintended paths.
Alibaba Cloud Outage
At 2:20 AM (UTC) on September 10, Alibaba Cloud detected problems in availability zone C of its Singapore region, which manifested as “abnormal” operations of cloud services. Not long after, the cause became clear: an “explosion of lithium batteries in the Singapore data center, which led to fire and elevated temperature.” Customers were urged to “migrate production workloads” to other functioning infrastructure “as soon as possible.”
The impacted facility is operated by Digital Realty. It said there were no injuries and little structural damage to the building; although, it didn’t address the state of the hosted equipment, which can be adversely affected by heat, as well as by the gas suppression systems used to treat fires.
For Alibaba Cloud, restoration activities were still occurring a week later, with the cloud services provider noting that “some of the affected hardware and machineries are located in the dangerous and blocked area of the building where access is not allowed, and some hardware and machineries require to be carefully dried in order to ensure data security. The restoration of some long tail machines and inventories may take a longer time,” Alibaba added.
Cloudflare Incident
On September 16, starting at approximately 4:05 AM (UTC), ThousandEyes observed Cloudflare experiencing a performance incident that led to connectivity issues in multiple locations leveraging its CDN and networking services. The issue manifested as connection and service timeouts, indicating a problem with backend connectivity and routing within the Cloudflare environment. ThousandEyes observed impacts on applications such as Zoom and HubSpot. The incident lasted approximately two hours before being resolved at approximately 6:00 AM (UTC).
Explore the Cloudflare incident further in the ThousandEyes platform (no login required).
By the Numbers
Let’s close by taking a look at some of the global trends ThousandEyes observed across ISPs, cloud service provider networks, collaboration app networks, and edge networks over two recent weeks (September 2-15):
-
The downward trend observed since mid-August continued into September. Over the period of September 2-15, the total number of global outages decreased. There was a 9% drop in the first week, with outages falling from 191 to 174. This trend continued into the following week, with outages decreasing slightly from 174 to 170 between September 9 and 15, a 2% decrease compared to the previous week.
-
The United States deviated from this pattern. Initially, outages increased by 13% during the first week of the period (September 2-8). However, in the following week (September 9-15), they decreased by 1%.
-
During the period of September 2 to 15, more than 50% of network outages occurred in the United States, compared to 45% in the previous two weeks (August 26 to September 1). This pattern has been fairly consistent throughout 2024, with U.S.-centric outages often accounting for at least 40% of all observed outages.