This is The Internet Report, where we analyze recent outages and trends across the Internet, through the lens of ThousandEyes Internet and Cloud Intelligence. I’ll be here every other week, sharing the latest outage numbers and highlighting a few interesting outages. This week, we’re also featuring a conversation exploring the world of subsea cables, with special guest Murray Burling, Executive Director of Oceans and Environment at RPS. As always, you can read the full analysis below or tune in to the podcast for firsthand commentary.
Internet Outages & Trends
There’s a line in the ThousandEyes Cloud Performance Report that I often return to that ponders the extent to which enterprise data is “in the cloud or under the sea.” The answer, of course, may be both: It’s in the cloud at-rest and under the sea in-transit.
Whether in the cloud or sea, that infrastructure needs to be understood by everyone who holds responsibility for that data and how it makes its way between different components of the end-to-end service delivery chain that underpins digital experience delivery. A suboptimal understanding can translate into customer-facing problems, manifesting as either degradation or an inability to access a digital service and/or transact with it.
The growing complexity of subsea and terrestrial cable infrastructure that underpins both enterprises’ and the world’s traffic flows presents a challenge for IT teams, who are trying to understand all the interactions, relationships, and dependencies that can impact their services and applications.
Read on for more about this and recent disruptions, or use the links below to jump to the sections that most interest you:
Subsea Cable Resiliency
Subsea cables are the backbone of international data movement today, with the Internet heavily reliant on undersea cables for high-capacity communications. A widely-cited piece of research suggests that 97% of the world’s Internet traffic is carried via subsea routes. Subsea cables increasingly also have a role to play in even domestic traffic management, with subsea routes being used in place of terrestrial fiber for intercapital or interstate routes as well.
Whether terrestrial or subsea, fiber cabling is susceptible to damage and that can lead to disruption or degraded performance on key routes. On terrestrial routes, breakages are often accidental damage caused by heavy earthmoving equipment, whose operators are unaware of cable paths prior to excavation work taking place. An equivalent hazard exists for undersea cables. At either end, the cables run through relatively shallow waters to a cable landing station, from which traffic is pushed onto terrestrial links. While the coastal zoning around landing stations tends to be protected, their nature means they may be exposed to encounters with shipping and other marine traffic. Heavy anchors, propellers, and cables do not mix. Add stormy weather into the area and accidental breakages can often occur. Even as cables run over the seabed in deeper water, there are other hazards to consider: Seismic activity and even sharks have been considered problematic in the past.
When damage does occur, locating a cut and re-splicing the fiber is considerably easier on terrestrial routes than it is for subsea. Repairing a damaged subsea cable can take some time, depending on the availability of the repair vessel, its engagement status, and its presence in the area. Additionally, time is needed to locate and fix the fault.
All that being said, submarine cables are increasingly designed with resilience, diverse paths, and multiple cable systems to ensure continued functionality. Increasing the number and diversity of cable routes can offer some mitigation. Establishing a high level of redundancy enhances resilience and helps to ensure that any damage is unlikely to have a significant impact on a user's digital experience.
Both geostationary and low earth orbit (LEO) satellites can also provide additional options to ensure business continuity, helping to supplement and fill gaps. Although they cannot entirely compensate for widespread cable loss, they can at least guarantee that crucial messages are transmitted and provide a feasible option for aligning application characteristics with infrastructure constraints, thereby reducing the load on subsea infrastructure.
The vastness of today’s subsea cable ecosystem drives home the importance of understanding your end-to-end service delivery chain, and the routes that traffic takes or may take as part of your digital experience delivery.
In our Cloud Performance Report, ThousandEyes analyzed network performance data to help organizations determine the optimal cloud ecosystems and zones/regions for their needs, based on ingress/egress points, how traffic reached those points (sometimes via subsea routes), and whether those routes were operationally acceptable.
But more than that, with so much traffic—domestic and international—traversing subsea paths, it’s critical to understand how and where those cables are landed, and what onward paths exist for that traffic, in order to really drive optimization and improvement in digital experience delivery. Careful attention and visibility here may reduce the risk of experiencing a reduction in capacity due to accidental cable breakage. It may also provide an evidentiary basis to align with providers that have prioritized capacity between the cable landing station and their nearest point-of-presence—upgrade work that has been occurring in Virginia recently.
Listen to the podcast for more subsea cable insights from Mike and special guest Murray Burling.
AWS Disruption
On August 29, AWS experienced a global issue between 8:32 AM and 9:58 AM (UTC) that impacted access to multiple cloud services. The problems manifested as an inability to contact AWS service endpoints in us-east-1 from seven other regions, among them us-gov-west-1. “Customers may also have received error messages when contacting the global STS (Security Token Service) endpoint from the affected regions,” AWS said.
The problems manifested as accessibility issues, and the company initially suspected a problem with its identity and access management service. However, upon their investigation, AWS engineers found the root cause to be an unspecified “networking issue,” which they remediated, allowing accessibility to recover.
In my experience, misidentifying the root cause of a degradation or outage happens fairly often, even more so when customers are unable to authenticate to or access services. Users, and even engineers, can too quickly draw their own conclusions about the root cause. However, as we have observed, some outages are not what they seem, and having independent visibility of every component of the service delivery chain can help teams accurately identify the root cause, and recognize “false flags” for what they are.
ServiceNow Outage
On August 26, ServiceNow experienced an outage that affected some customers, preventing them from accessing their ServiceNow instances. The outage was first observed around 7:25 PM (UTC) and initially appeared to be originating from ServiceNow’s environment, with an increase in network connection and server timeouts observed across ServiceNow servers.
Explore the ServiceNow outage further in the ThousandEyes platform (no login required).
During the incident, the ServiceNow network experienced a higher than normal loss rate across multiple ISPs. This suggests an increase in network traffic that was overwhelming the network circuits—a classic sign of a potential DDoS attack. ServiceNow later confirmed that DDoS activity caused a surge in network traffic that saturated externally-facing network circuits in their Arizona and Ontario data centers, causing the affected instances to become inaccessible to most users.
Around 8:30 PM (UTC), the loss rate levels began to decrease, and connectivity appeared to return for the majority of customers. The residual impact was observed until around 9:40 PM (UTC) when full connectivity was restored.
By the Numbers
Let’s close by taking a look at some of the global trends ThousandEyes observed across ISPs, cloud service provider networks, collaboration app networks, and edge networks over the past two weeks (August 19 - September 1):
-
Over the period of August 19 to September 1, the total number of global outages decreased. There was a 3% drop in the first week, with outages falling from 211 to 204. This trend continued into the following week, with outages decreasing from 204 to 191 between August 26 and September 1, a 6% decrease compared to the previous week.
-
The United States did not follow this pattern. Outages increased initially during the first week of the period (August 19-25), rising by 27%. However, in the following week (August 26 to September 1), they decreased, dropping by 23%.
-
Despite the somewhat significant 23% decrease in U.S. outages during that second week, U.S.-centric outages still made up more than 40% of all global outages during the fortnight. Between August 19 and September 1, 45% of network outages took place in the United States, compared to 44% in the previous two-week period (August 5-18). This trend has been consistent throughout most of 2024, with U.S.-centric outages often making up at least 40% of all observed outages.
-
In August, 888 outages were observed worldwide, a 9% increase from the 816 reported in July. In the U.S., outages also increased, rising 16% from 334 in July to 387 in August. This differs from previous years, when total global outages between July and August typically increased, but U.S.-specific outages decreased in August before rising again in September.