This is The Internet Report, where we analyze outages and trends across the Internet through the lens of ThousandEyes Internet and Cloud Intelligence. I’ll be here every other week, sharing the latest outage numbers and highlighting a few interesting outages. As always, you can read the full analysis below or listen to the podcast for firsthand commentary.
Internet Outages & Trends
People love a good detective story. There’s something so satisfying about watching an investigator gather clues and piece them together to figure out what actually happened.
IT operations teams often get to be the hero of their own detective story, piecing together various clues to determine what’s causing an outage. And it’s vital that they don’t overlook a clue or zero in on just one piece of evidence, missing the full picture.
Recent outages that impacted Starlink and Google Maps SDKs illustrate this point, showcasing how discovering the likely root cause of an outage requires examining multiple symptoms together to clarify and confirm a diagnosis.
Read on to learn more about what happened during these incidents and also get an update on the recent Red Sea cable cuts (or use the links below to jump to the sections that most interest you):
Starlink Outage
On September 15, beginning around 4:30 AM (UTC), Starlink's satellite Internet service experienced a brief global outage affecting users across multiple continents. The outage was notable not for the volume of user reports, but for its simultaneous global scope.
ThousandEyes monitoring observed service disruption across North America, Europe, Asia, and Australia at the same coordinated time—which was during the night for much of North America and during the late morning for parts of Asia and Europe. The incident lasted approximately 15 minutes, with service restoration completed by 4:45 AM (UTC) across most affected regions
Understanding Satellite Internet Infrastructure
Starlink operates through two primary infrastructure layers: the space-based satellite constellation and ground-based gateway infrastructure. User terminals connect to overhead satellites, which then route traffic through the satellite mesh before ultimately reaching ground-based gateway stations that provide connectivity to the broader Internet.
Depending on where in this architecture an outage’s root cause lies, the disruption will display different symptoms. Satellite connectivity issues typically manifest as terminals being unable to establish basic connections to the constellation—users see "searching" behavior as terminals attempt to acquire satellite signals. Gateway infrastructure problems, however, can allow satellite connectivity to remain intact while Internet services like DNS resolution and traffic routing fail.
Understanding this distinction becomes critical when diagnosing satellite Internet outages.
What We Observed During the Starlink Outage
Looking at the September 15 Starlink outage, ThousandEyes observed multiple symptoms that helped narrow down the likely cause.
The outage’s simultaneous global nature left an important clue. Unlike terrestrial Internet outages that typically follow geographic or network topology patterns, this incident was seen affecting monitoring points across vastly different regions at exactly the same time. This suggested problems with core systems that coordinate service across Starlink's entire network.
Explore the Starlink outage further in the ThousandEyes platform (no login required).
Additionally, the disruption’s staggered recovery pattern and other key characteristics provided further insight into the root cause:
- DNS Resolution Failures: The primary issue manifested as widespread DNS failures across multiple monitoring locations globally.
- Limited Connection Issues: Far fewer agents reported basic connection problems compared to DNS resolution failures.
- Staggered Recovery: Service restoration occurred progressively rather than simultaneously across all regions.
This technical evidence suggested that Starlink terminals maintained their ability to connect to the satellite constellation. In other words, the fundamental satellite-to-terminal link remained operational. However, monitoring showed loss of ability to resolve Internet addresses and route traffic beyond Starlink's network, indicating a failure in the gateway infrastructure responsible for bridging Starlink's space-based network with terrestrial Internet services.
The rapid recovery timeframe of approximately 15 minutes also pointed to a gateway infrastructure issue—fixes to gateway infrastructure tend to be much more efficient than satellite constellation adjustments, for example.
Considerations for Enterprise NetOps Teams
This Starlink outage demonstrated how satellite Internet outages can create misleading symptoms. Monitoring showed that terminals could establish connections to the satellite constellation, but traffic failed to route beyond Starlink's network infrastructure.
For enterprise organizations using satellite connectivity, this incident highlights a few important considerations:
- Monitor beyond connection establishment: Basic connectivity tests may pass while actual Internet services fail. This failure mode—successful satellite connection with failed Internet routing—can complicate troubleshooting since initial diagnostics may appear normal.
- Account for multiple failure modes in backup planning: Satellite Internet can fail at different infrastructure layers. Complete satellite connectivity loss requires different backup strategies than gateway infrastructure failures that maintain satellite links but block Internet access.
Google Maps SDK Outage
On September 11, Google reported issues with their Maps SDK services affecting mobile applications on both Android and iOS platforms. Starting at 6:12 PM (UTC), the Google Maps outage lasted approximately four hours and 15 minutes, with full resolution declared at 10:27 PM (UTC). During the disruption, mobile apps integrating Google's Maps SDK displayed blank screens and "Cannot reach server" error messages, while people accessing Google Maps through web browsers experienced no issues whatsoever.
What We Observed During the Google Maps SDK Outage
To understand this outage, it's important to grasp how mobile map rendering works. Both web-based Google Maps and mobile SDK implementations use a tile-based system to display maps. Think of map tiles like puzzle pieces—the map you see is composed of many small square images (typically 256x256 pixels) that fit together seamlessly. When you zoom in or pan around a map, your device requests different tile combinations to show the appropriate level of detail for your current view.
The critical difference lies in how these tiles are requested and processed. Web browsers make direct HTTP requests to Google's tile servers and handle the rendering themselves. Mobile SDKs, however, use a more complex system where Google's software embedded in mobile apps manages the tile retrieval and rendering process. This SDK acts as an intermediary layer between the mobile app and Google's mapping infrastructure.
During the outage window, ThousandEyes data showed all monitored Google services as healthy and reachable.
Targets like *.googleapis.com and standard web-accessible endpoints continued responding successfully. DNS resolution worked properly, SSL handshakes completed without issues, and basic connectivity tests passed.
From an external monitoring perspective, Google's infrastructure appeared fully operational. External monitoring showed Google's core mapping infrastructure remained healthy throughout the incident, while Google's official status confirmed that Maps SDK for iOS, Android, and Navigation SDK were experiencing failures. This combination suggested that the issue resided in SDK-specific infrastructure rather than core mapping systems, with the problem occurring during SDK initialization—the process where mobile apps establish their connection to Google's mapping services.
Lessons for ITOps Teams
This incident illustrates the value of correlating multiple information sources when diagnosing complex issues. External monitoring showed Google's infrastructure as healthy, while user reports indicated service problems. By combining monitoring data with Google's official status updates, the likely root cause became clear—specialized backend systems supporting mobile implementations had failed while public-facing infrastructure remained operational.
For ITOps teams, this highlights the importance of gathering evidence from multiple sources when external monitoring and user reports don't align. The challenge isn't that monitoring failed to detect the issue, but rather that the initial symptoms required careful analysis to identify where in the service architecture the problem resided.
Organizations should consider how they would identify and escalate issues when the evidence initially appears contradictory. Standard support channels may not immediately recognize problems if the provider's primary status page shows healthy status, making independent monitoring—coupled with direct communication paths—valuable for getting accurate incident details.
Update: Red Sea Cable Cuts
In our last blog post, we covered the September 6 Red Sea submarine cable damage that affected SMW4, IMEWE, FALCON GCX, and Europe India Gateway systems which carry Internet traffic between Europe, the Middle East, and Asia. While the physical repairs could take weeks, about eight days after the initial incident, ThousandEyes observed a significant improvement in network performance.
To maintain connectivity, providers had initially rerouted traffic through different paths, with varying results. Connectivity was preserved, but some areas experienced increased latency and/or packet loss. However, starting around September 14 at 8:08 PM (UTC), monitoring of specific routes like AWS Mumbai to Frankfurt showed a sudden return to near pre-damage performance levels.
This suggested that network operators had refined their initial alternative rerouting strategies, implementing a solution that provided equivalent performance to the original Red Sea transit paths.
Analysis of the routing changes indicated that this improvement likely resulted from two primary factors:
- Alternative submarine cable systems: Rather than continuing to route through suboptimal backup paths, operators appeared to have activated capacity on completely different submarine cable systems. These alternative systems likely route through different geographic corridors (potentially via the Gulf), through alternative Red Sea cables that weren't damaged, or through entirely different continental routing paths.
- Reconfigured peering arrangements: The eight-day timeframe required for this improvement suggested substantial behind-the-scenes negotiations and configuration changes, rather than a simple route optimization. Network operators likely established new peering relationships and capacity agreements that provide direct access to alternative submarine cable infrastructure, bypassing the need for the damaged Red Sea systems entirely.
By the Numbers
Let’s close by taking a look at some of the global trends ThousandEyes observed over recent weeks (September 8-21) across ISPs, cloud service provider networks, collaboration app networks, and edge networks.
Global Outages
- From September 8 to 14, ThousandEyes observed 301 global outages, representing a slight 2% decrease from 308 the prior week (September 1-7)—which was the highest weekly outage total seen in almost a month.
- During the week of September 15 to 21, global outages remained essentially stable at 302, showing minimal change from the previous week. This consistency suggests global outage levels have plateaued near the elevated levels first seen in early August (just over 300 outages per week). These levels are well above the 187 outages we saw in late July and the very beginning of August (July 28 - August 3), indicating a sustained season of higher network disruption.
United States Outages
- The United States experienced a notable spike during September 8-14, with outages climbing to 184—representing an 11% increase from the previous week's 166 and marking the highest weekly total observed in the entire tracking period (July 28 - September 21).
- This peak was followed by a significant correction during September 15-21, when U.S. outages dropped to 161, representing a 12% decrease from the previous week's high.
- Despite this decline, the 161 outages recorded during September 15-21 still represented the second-highest weekly total for U.S. network disruptions in the tracking period, indicating continued elevated activity levels.
- Over the two-week period from September 8-21, the United States accounted for 57% of all observed network outages, representing a majority of global network disruptions during this timeframe.