This week on the Internet, we saw some sizable incidents that were either attributed to—or blamed on—the domain name system (DNS). All, in some way, manifested in the way DNS issues usually do: with time-out errors as users attempted to access websites.
It’s understandable why everyone’s first thought would be DNS. It’s the reason why the meme-phrase “It’s always DNS” generates almost 14,000 search engine results monthly.
Except, it isn’t always DNS anymore.
Take our weekly Internet Report series as evidence. The outages or performance blips we report on are often related to cloud, network or some other obscure but central dependency or interdependency that represents a single point of failure. And while it’s true that DNS outages often correlate to website or service outages, not all of those outages correlate to DNS outages.
And so, this week, we examine three incidents that, on the surface, appear to be DNS-specific: a consumer electronics company (which technically experienced two issues in two days), Australia’s .au domain issue and an issue with a global cloud service provider’s load balancer service.
In all three cases, DNS was either suspected to be the cause, attributed as the cause or confirmed to be the cause. We’ll explore those nuances, but it might be helpful to preface this with an explanation of why DNS is often the culprit from my colleague, Angelique Medina:
“DNS is important because that's effectively the first step to reaching a site. DNS is basically the translation between a human-readable name and an IP address. And so anytime that you want to reach a site like Amazon or Twitter, you need to do a lookup, and you need to request a DNS record. And then, when you get a response back, you can then connect to the IP address that you've received in the response… It's such a foundational system, basically mapping users to their destinations.”
This past week, a major consumer electronics company experienced two incidents, in two consecutive days. The first, on March 21, impacted a large number of consumer-facing services as well as the company’s internal systems. The second, 24 hours later, had a smaller blast radius but was still felt by users. All we know is it was reportedly both DNS-related and resolved. ThousandEyes confirmed a global impact at the application level, but the network was up, and everything was technically reachable.
Then, on March 22, Akamai, a major CDN and cloud services provider, experienced an issue with its global load balancer service, Global Traffic Management. The effect of the incident was that several major websites were rendered unavailable, which naturally led to initial suspicions the culprit was DNS (which the company would specifically rule out later in an advisory).
Domain name resolution was working, but user traffic then couldn’t be delivered to its destination. We saw high traffic loss rates on paths to and from the affected websites.
Finally, we saw an incident with the .au domain that reportedly caused issues for about 15,000 websites—and impacted the ability of users within Australia to access them—during a more than two-hour period. Similar to the other incidents, DNS was suspected, but this is the only incident where the suspicions were confirmed officially.
Administrator auDA’s explanation was of “a bug in the process that generates the DNSSEC digital signing records.” It also said that “end users using one public DNS resolver were impacted.”
This issue was particularly painful because it had to do not just with the authoritative DNS server, but the public DNS resolver, and so changing resolvers did not allow impacted end users to skirt around the problem, as is often the suggested workaround for DNS issues from an end user perspective.
In our own monitoring, we could see name resolution failing for a number of Australian companies’ websites; however, the impact was intermittent, and not all Australian Internet users reported problems.
That brings us to this week's outage numbers. Globally, last week's upward trend continued as we observed total outages increase from 216 to 234, an 8% increase compared to the previous week. This increase was also reflected domestically, where outages numbers rose from 82 to 99, a 17% increase compared to the prior week. When we look at total global outages, the proportion of U.S. outages also increased this week to 42%, which is 4% higher than the previous week, where it was only 38%. This increase means that the levels continue to trend toward the average percentages we saw across 2021.
And so, that was the week on the Internet: one confirmed DNS issue; one reported DNS issue; and one suspected DNS issue that turned out not to be. Which shows two things: 1) that in 2022, it’s no longer always DNS; and 2) without holistic visibility of your infrastructure, and of the end-to-end digital experience, it is still too easy to misinterpret a problem and waste time looking in all the wrong places.