This past week, a powerful volcanic eruption disconnected the Kingdom of Tonga from the rest of the world as a crucial telecommunications cable, carrying voice and data traffic, was severed. Some voice communications were re-established through geostationary satellites, but it is expected to take about a month to fix the subsea cable.
The cable damage has highlighted two things:
- The vulnerability of Tonga (and nations in similar circumstances) where they have one “pipe” in and out and are, subsequently, a step away from total isolation
- More broadly, inequity in access to Internet infrastructure, despite a world that is increasingly digitally driven
As we move more to the metaverse and the world becomes more digitized, it’s becoming clear that many countries and locations rely on fragile Internet connectivity. However, it could also be the hyperscalers, those with the keenest interest in driving the growth and reach of the metaverse, that start to bridge some of this divide. As we saw this week, that concentration of Internet infrastructure among a handful of prominent players is already happening.
Taking a broader view of Internet health this week, we saw more of the same when it comes to outages. That is, while Internet outage numbers remain higher than a year ago, they have leveled off, particularly through December and now into January, which matches the patterns we’ve seen in both the early part of 2020 and 2021.
The upward trend in global outages, which we’ve observed since the turn of the year, continued into its third week with an (albeit small) 3% increase over the previous week. Domestically, however, we saw a reversal of this trend as the number of observed U.S. outages dropped from 86 to 81 — a 6% decrease from the prior week. This decrease in U.S. outages, in conjunction with the rise in total global outages observed, means that, in total, U.S. outages accounted for 33% of all global outages last week, which is 3% lower than the previous week and 15% lower than the same period in 2021.
Interestingly, we have not observed any major outages in the latter part of 2021 and early 2022. Instead, we continue to see outages and performance degradation that appear to be consistent with routine maintenance and engineering tasks. The “blast radius” of these smaller events is often larger than a single outage due to the complex interdependencies between different providers and services. In today’s environment, one provider’s maintenance window can have a flow-on effect downstream.
That said, one of the trends worth discussing is the shift in when this work occurs.
Historically, on a global basis, outages peaked on Fridays. However, through 2021, the numbers started to level out towards the middle of the week. And then, by the end of calendar year 2021, the bulk of the outages we saw were in the early hours of Monday night and Tuesday morning.
When we break the numbers down on a regional basis, it’s clear that this trend isn’t being driven by North America, where the bulk of outages occur on Thursday night or Friday morning, and no change in this trend has been observed.
If we compare January 2021 to January 2022, there’s a “changing of the guard” in terms of the region responsible for the most outages. Where, historically, North American ISPs were responsible for the majority of outages seen globally, we’ve seen an increase in outages from ISPs in EMEA. ISPs in this region traditionally perform work and push updates out on Monday night and Tuesday morning; the rise in the amount of work they performed through 2021 means they are a greater influence on global outage statistics, both in total as well as when in the week those outages occur.
So why would different regions tend to concentrate maintenance and engineering work in the same time window each week?
I have two hypotheses.
First, ISPs in EMEA may push updates earlier in the week so that if there’s any flow-on impact or fallout, they still have time left in the regular working week to restore or rectify services, reducing the likelihood of having to perform work over the weekend. As to why North American ISPs do not follow suit, perhaps they are more confident in the insights they have at hand before they push an update, and know with a greater certainty the impact of a change will be as minimal as possible.
Second, an alternative theory is that we’re simply looking at two different cycle times in terms of when updates are coded and when they are ready for live deployment. An early-in-the-week release schedule might point to coding work being performed the previous week and perhaps the use of more “waterfall” influenced approaches to software deployment. In contrast a late-in-the-week release might indicate the use of more agile coding methodologies and the treatment of a week or fortnight as a sprint.
I welcome comments or insights on Twitter from engineers in either region as to whether these hypotheses stand. We’ll also continue to monitor the day of the week trends throughout 2022 to track any departure from the trends we’ve observed to date.