New Podcast
Managing Traffic During Peak Demand; Plus, Microsoft, Akamai Outages

The Internet Report

Managing Traffic During Peak Demand; Plus, Microsoft, Akamai Outages

By Mike Hicks
| | 18 min read
Internet Report on Apple Podcasts Internet Report on Spotify Internet Report on SoundCloud

Summary

The recent Oasis reunion tour ticket sales process revealed a lot about current techniques for web traffic management. Learn more about this, and also explore traffic-related outages at Microsoft and Akamai.


This is The Internet Report, where we analyze recent outages and trends across the Internet, through the lens of ThousandEyes Internet and Cloud Intelligence. I’ll be here every other week, sharing the latest outage numbers and highlighting a few interesting outages. As always, you can read the full analysis below or tune in to the podcast for firsthand commentary.


Internet Outages & Trends

During high-traffic seasons like Black Friday or the launch of a much-anticipated product, maintaining quality digital experiences for customers is critical. We’ve all heard stories of websites that crashed during a major sale—leaving online shoppers unable to make their purchases. To avoid a complete breakdown like this, companies sometimes use various traffic management strategies, like digital waiting rooms, during high-traffic periods. 

Replicating an approach that physical stores sometimes use to limit foot traffic during busy periods, digital waiting rooms have shoppers bide their time in a virtual waiting space before being admitted to the site. This reduces the amount of traffic the site has to support at any one time, and it allows a gradual ramp to hopefully avoid overloading any infrastructure components involved. 

This approach has both pros and cons. On the one hand, some users’ experiences may be less immediate than they’d expected. However, in the end, they’re still able to complete their purchase, rather than being confronted by a totally broken process. And a slower experience for some hopefully maximizes the number of people able to have a positive experience throughout, instead of the process being fully stalled for everyone.

Online ticketing platforms offer a prime example of virtual waiting rooms and other traffic management strategies in action. Ticketing websites can see enormous fluctuations in traffic levels, depending on the events they are selling tickets for. During the recent Oasis reunion tour ticket sales, ThousandEyes observed different approaches to managing the traffic between various ticketing platforms, illustrating the tradeoffs ITOps teams have to make as they try to maintain the best possible digital experience for as many users as possible during high-traffic seasons.

And it wasn’t just ticketing platforms that had to deal with anomalous traffic patterns in recent weeks. Users experienced issues accessing Microsoft’s cloud services when attempting to route traffic over AT&T peering infrastructure, while a route leak, performance incident, and even a fire also caused problems for ISPs, cloud-based application makers, and cloud service providers, respectively.

Read on to learn about all these disruptions, as well as current traffic management strategies, or use the links below to jump to the sections that most interest you:


Oasis Reunion Ticket Issues

For many events, ticket demand sometimes outstrips supply so technical traffic management techniques are used to triage the sudden influx of would-be purchasers. Like other eagerly-anticipated shows before it, online ticket sales for the Oasis reunion tour were likely going to trigger some of these traffic mechanisms. What was interesting is how each of the ticket-selling platforms employed traffic management techniques and processes to control the flow of traffic through their systems.

ThousandEyes observed load building up long before the official ticket sales opening at 8:00 AM (UTC) / 9:00 AM (BST) on August 31, which manifested as an increase in response times across all ticket agencies from around 6:45 AM (UTC).

Screenshot showing Oasis Reunion ticket issues
Figure 1. ThousandEyes saw increased response times a few hours before ticket sales opened

The official ticketing services selling Oasis tickets—Ticketmaster, See Tickets, and Gigs and Tours—each operate a form of a “digital waiting room” that includes a purchasing queue. See Tickets also appeared to implement rate-limiting measures during the Oasis sale.

When tickets went on sale at 8:00 AM (UTC), ThousandEyes immediately observed some requests returning HTTP 503 service unavailable messages, which coincided with a slight drop in the number of objects that loaded on the page. Around 15 minutes after ticket sales officially opened, ThousandEyes started to see some requests that appeared to have succeeded, rather than timeout or return an error code. This pattern continued through the sale period, with an increase in page load and subsequent wait times, mixed with an occasional timeout.

Ticketmaster’s Approach

Ticketmaster appeared to be leveraging Fastly as its CDN. ThousandEyes observed good connectivity up to the CDN, but then saw intermittent timeouts and 503 "maximum connection” backend errors before they sold out of their ticket allocation, which indicates a slow or unresponsive backend. These 503 errors are typically server side and suggest that network connectivity and reachability are good. However, they indicate that the maximum number of concurrent connections to the origin server may have been reached.

Screenshot showing Oasis Reunion ticket issues
Figure 2. Request for content returned server-side Backend.max_connection reached error

Any Oasis fan that encountered the 503 error would likely have seen their transaction disrupted, as calls and requests were being made from the front page to backend services, where the errors were occurring.

Screenshot showing intermittent errors, timeouts, and successful requests
Figure 3. ThousandEyes observed intermittent errors, timeouts, and successful requests when Oasis tickets were available for purchase on Ticketmaster

As the sales period progressed, requests were redirected to a waiting room until access could be granted.

Screenshot showing requests being directed to “waiting room”
Figure 4. Request directed to “waiting room” service until access could be granted

Explore Ticketmaster’s performance further through the ThousandEyes platform (no login required).

See Tickets’ Approach

Other authorized ticket sellers like See Tickets and Gigs and Tours also operate a similar digital waiting room. With See Tickets, in particular, ThousandEyes observed disruption with service unavailable messages and HTTP 429 “Too Many Requests” errors, which suggest that rate-limiting was being applied to the domain.


Dive deeper into See Tickets’ performance through the ThousandEyes platform (no login required).

Page load time and wait time also increased prior to the opening of sales, which is likely the result of people trying to get in the queue early. By 7:25 AM (UTC), 35 minutes before opening, page load time continued to increase, and HTTP 500 internal system errors began to return, as ThousandEyes observed.

Screenshot showing increased page load time
Figure 5. ThousandEyes observed increased page load time and timeouts on seetickets.com

Once tickets went on sale, ThousandEyes observed HTTP 429 errors, which often indicates that there are too many requests for data being made. It’s also likely that a rate-limiting process to control the traffic flow is causing these errors.

Screenshot showing HTTP 429 errors
Figure 6. ThousandEyes saw requests for content from seetickets.com, which returned HTTP 429 errors

Users were also presented with busy and error messages while trying to load the page, indicating that the site was still reachable and online, but just experiencing excessive load.

Screenshot showing busy and wait conditions
Figure 7. ThousandEyes observed busy and wait conditions when requesting content from seetickets.com

The traffic management techniques used to handle this temporary increase in website traffic demonstrates the compromises ITOps teams sometimes make as they try to ensure the best possible digital experience for as many users as possible during peak traffic periods. The placement of traffic management in the purchasing process likely affected the user experience in different ways. Some users may have experienced the biggest delays right at the beginning while trying to access the purchasing process. Others who were able to proceed may have encountered delays at different stages of the purchasing flow, such as seat selection, data entry, or payment.

Microsoft Outage

At approximately 11:40 AM (UTC) on September 12, some users experienced issues when attempting to reach Microsoft services, such as Microsoft 365. ThousandEyes observed significant packet loss, as well as connection timeouts within Microsoft’s network during the incident. From ThousandEyes’ perspective, the problems were limited to a subset of users connecting to Microsoft’s network directly from or through the AT&T peering point.

The incident was resolved by approximately 1:20 PM (UTC). The timeline of events that ThousandEyes observed closely aligns with Microsoft’s official status notifications, which report a duration between 11:46 AM and 1:14 PM (UTC).


Explore the Microsoft outage further in the ThousandEyes platform (no login required).

Screenshot showing Microsoft services experiencing an outage
Figure 8. Microsoft services, including Microsoft Online and Microsoft 365, experienced an outage on September 12
Screenshot showing that the problems were limited to some users connecting via AT&T
Figure 9. ThousandEyes observations show that the problems were limited to a subset of users connecting to Microsoft’s network via AT&T

Microsoft suggested that the root cause was an unspecified change made by AT&T that was then rolled back. AT&T later confirmed a “brief disruption connecting to some Microsoft services.”

During the incident, ThousandEyes observed high forwarding loss in the Microsoft network, which affected various Microsoft services. This incident appeared to have mostly a regional impact on AT&T customers in the U.S. or users whose ISPs used the AT&T peer network. Interestingly, Azure-hosted services seemed to be unaffected, even those utilizing the affected AT&T-Microsoft peering point.

Screenshot showing Traffic transiting through the AT&T-Microsoft peering point

Screenshot showing traffic transiting through the AT&T-Microsoft peering point
Figure 10. Traffic transiting through the AT&T-Microsoft peering point was a common factor for affected companies

Akamai Outage

On September 5, multiple broadband providers in the U.K. encountered issues reaching some Akamai services. The packet loss coincided with the appearance of Autonomous System (AS) 7473 (Singtel) in the path. It appears that the introduction of the AS into the inter-European route redirected traffic between European destinations through Singapore, which caused unexpected problems due to the unanticipated traffic path or additional load.

Screenshot showing the appearance of AS 7473 in the path
Figure 11. Packet loss coincided with the appearance of AS 7473 in the path

Explore the Akamai outage further in the ThousandEyes platform (no login required).

The cause of this incident appeared to be a BGP route leak, which briefly altered traffic patterns to the Akamai services. When we looked at the BGP routing during that time, ThousandEyes observed BGP path changes due to what appeared to be a service provider erroneously inserting themselves into the path. The Internet Engineering Task Force (IETF) defines a BGP route leak, in RFC 7908, as the unauthorized spread of routing announcements beyond their intended scope. In other words, it occurs when an AS mistakenly propagates a learned BGP route to another AS, violating the policies of the intended receiver, sender, and/or one of the ASes along the preceding AS path. This can lead to the redirection of traffic through unintended paths.

Screenshot showing AS 7473 being inserted, withdrawn, reinserted, and then returning to the original path
Figure 12. ThousandEyes observed AS 7473 being inserted, withdrawn, reinserted, and then returning to the original path

Alibaba Cloud Outage

At 2:20 AM (UTC) on September 10, Alibaba Cloud detected problems in availability zone C of its Singapore region, which manifested as “abnormal” operations of cloud services. Not long after, the cause became clear: an “explosion of lithium batteries in the Singapore data center, which led to fire and elevated temperature.” Customers were urged to “migrate production workloads” to other functioning infrastructure “as soon as possible.”

The impacted facility is operated by Digital Realty. It said there were no injuries and little structural damage to the building; although, it didn’t address the state of the hosted equipment, which can be adversely affected by heat, as well as by the gas suppression systems used to treat fires.

For Alibaba Cloud, restoration activities were still occurring a week later, with the cloud services provider noting that “some of the affected hardware and machineries are located in the dangerous and blocked area of the building where access is not allowed, and some hardware and machineries require to be carefully dried in order to ensure data security. The restoration of some long tail machines and inventories may take a longer time,” Alibaba added.

Cloudflare Incident

On September 16, starting at approximately 4:05 AM (UTC), ThousandEyes observed Cloudflare experiencing a performance incident that led to connectivity issues in multiple locations leveraging its CDN and networking services. The issue manifested as connection and service timeouts, indicating a problem with backend connectivity and routing within the Cloudflare environment. ThousandEyes observed impacts on applications such as Zoom and HubSpot. The incident lasted approximately two hours before being resolved at approximately 6:00 AM (UTC). 


Explore the Cloudflare incident further in the ThousandEyes platform (no login required).

Screenshot showing applications impacted during the Cloudflare disruption
Figure 13. Cloudflare service disruptions impact reachability of Zoom, RingCentral, and other applications

By the Numbers

Let’s close by taking a look at some of the global trends ThousandEyes observed across ISPs, cloud service provider networks, collaboration app networks, and edge networks over two recent weeks (September 2-15):

  • The downward trend observed since mid-August continued into September. Over the period of September 2-15, the total number of global outages decreased. There was a 9% drop in the first week, with outages falling from 191 to 174. This trend continued into the following week, with outages decreasing slightly from 174 to 170 between September 9 and 15, a 2% decrease compared to the previous week.

  • The United States deviated from this pattern. Initially, outages increased by 13% during the first week of the period (September 2-8). However, in the following week (September 9-15), they decreased by 1%.

  • During the period of September 2 to 15, more than 50% of network outages occurred in the United States, compared to 45% in the previous two weeks (August 26 to September 1). This pattern has been fairly consistent throughout 2024, with U.S.-centric outages often accounting for at least 40% of all observed outages.

Bar chart showing global and U.S. outages over time
Figure 14. Global and U.S. network outage trends over eight recent weeks

Subscribe to the ThousandEyes Blog

Stay connected with blog updates and outage reports delivered while they're still fresh.

Upgrade your browser to view our website properly.

Please download the latest version of Chrome, Firefox or Microsoft Edge.

More detail