ThousandEyes is part of Cisco   Learn More →
Webinar
Best Practices for Assuring SD-WAN Performance

Outage Analyses

Akamai DNS Outage Analysis

By Angelique Medina
| July 22, 2021 | 10 min read

Summary

Learn how the July 22nd Akamai DNS outage unfolded, why services experienced the same outage differently, and three lessons you can take away from this incident.


On July 22, at approximately 8:38AM PT, Akamai’s DNS—a critical service which directs users to its CDN edge—suffered an outage that prevented users around the globe from reaching its customers’ sites. Users attempting to browse sites hosted by Akamai received error messages indicating that the requested domain name could not be resolved to an IP address. The issue was resolved and service restored approximately one hour later at 9:45AM PT. You can read Akamai’s summary of the service disruption here

Resolving domain names to IP addresses via the Domain Name System is a critical first step in reaching a web property. Although brief, the scope of the outage was significant, with a large number of websites and applications (ranging from gaming to major banks, airlines and more) completely or intermittently unreachable. While network connectivity to Akamai’s CDN edge infrastructure was available during the outage, without Edge DNS's authoritative nameservers to resolve domain names, the websites and applications became unreachable to users.

ThousandEyes observed a spike in web and application outages during the incident—all hosted on Akamai’s servers.

Figure 1 - Application and web outages due to Akamai Edge DNS outage.png

Figure 1. Application and web outages increased during the incident.

ThousandEyes further observed that Akamai’s DNS service was unable to resolve the domains hosted in Akamai's CDN.

Figure 2 - HTTP connection to akamai customer site fails during DNS phase.png
Figure 2. HTTP connection to site in Akamai's CDN fails during DNS resolution phase

Role of DNS in CDN Traffic Management

The Domain Name System (DNS) maps human readable domain names, such as “example.com,” into IP addresses. CDN providers commonly use DNS to load balance traffic across their infrastructure and direct users to the optimal edge server based on geographical proximity, server availability or performance, and other factors. To use the DNS in this way, the service provider must host DNS records for the sites in their CDN. Enterprises will typically configure a domain name such as “www” as a CNAME record, which may be pointed to additional CNAMEs, and ultimately ending with an A record which provides the IP address. This layered approach allows the service provider to control which IP address in the CDN a client receives, so as to optimize the client experience, and provides flexibility to change the IP addresses when needed.

For example, ThousandEyes uses Akamai’s CDN service to host our website “www.thousandeyes.com.” This domain name resolves to a CNAME in Akamai DNS's "edgekey.net" zone. That CNAME also resolves to an Akamai CNAME in the akamaiedge.net zone, and that name then resolves to an A record with the IP address needed to reach the CDN edge (see figure 3 below).

Figure 3 - DNS query resolves to an Akamai CNAME .png
Figure 3. DNS query for www.thousandeyes.com

If the CDN’s DNS service were to become unavailable, then the CDN edge would effectively be unreachable as well. This is what happened during the Akamai DNS service disruption, although the impact of the outage varied across its customers and users based on multiple factors.

Impact of Outage Varied Widely

Akamai is one of the top global CDN providers, with a significant customer base ranging from large banks and SaaS providers to major ecommerce sites, such as Amazon and others. During the incident, ThousandEyes observed significant variation in impact across sites using its services, with some organizations maintaining greater availability than others. 

For sites solely using Akamai’s DNS and CDN service, some end-users would see an Akamai-hosted site as unavailable throughout the DNS outage.

Figure 4 - Akamai customer site inaccessible for the duration of the incident.png
Figure 4. Akamai customer site inaccessible for the duration of the incident due to DNS resolution timeouts

Those users saw DNS resolution errors in their browsers, and further troubleshooting (such as with ThousandEyes DNS tests) showed DNS SERVFAIL errors or no responses when attempting to reach authoritative nameservers in the Akamai DNS service. 

Figure-5-Akamai-Edge-DNS-Authoritative-Servers-Unable-Resolve.png
Figure 5. Akamai DNS authoritative servers unable to provide resolution

Not every Akamai customer was similarly impacted. Amazon’s ecommerce site saw nearly no impact during the incident. Amazon differs from the customer shown above in that they use multiple CDN providers to host their site’s content and leverage their own DNS service to balance traffic across each of their CDN providers. This architecture has several advantages. CDN providers may have different geographic coverage or be optimized to deliver certain types of content. And a multi-CDN approach also increases site resiliency, so no individual CDN provider is a potential single point of failure. Amazon was able to distribute traffic to its providers throughout the outage in ways that appear to have spared their users from impact.

Figure 5 - Amazon's multi CDN approach reduced the impact for their users.png
Figure 6. Amazon's multi CDN approach reduced the impact for their users.

Amazon did not completely eliminate Akamai for content delivery during the outage. Some locations and content continued to be used, as seen in figure 7, without degrading user experience.

Figure_6._Content_continued_to_be_served_from_Akamai,_without_degrading_users'_experience.png
Figure 7. Content continued to be served from Akamai, without degrading users' experience.

Another ecommerce provider was mostly available throughout the outage, but users to the site may have noticed longer page load times due to long waits to receive DNS responses, as seen in figure 8 below.

Figure 7 An ecommerce provider was largely accessible during the incident but experienced DNS performance degradation that impacted server response and page load times.png
Figure 8. An ecommerce provider was largely accessible during the incident but experienced DNS performance degradation that impacted server response and page load times.

This ecommerce provider continued to use Akamai for its site’s root object, but leveraged other CDN providers for significant portions of its page elements. 

Outage Takeaways

The Akamai DNS outage is yet another reminder that outage outcomes are not solely the responsibility of external providers. Outages are inevitable, regardless of which provider or service is used. However, organizations can and should take measures to reduce risks to their digital business by considering redundancy for critical services and having plans in place to address inevitable, unplanned disruptions.

Here are the three top lessons to take away from this incident:

  • Consider leveraging redundant providers for key services, such as CDN and DNS. Multiple CDN providers can increase service resilience, as well as improve performance for users. Akamai customers using multiple CDN providers were least impacted by this outage.
  • Have back-up plans in place for when things inevitably go wrong. Even if you’ve implemented best practice, redundant service architectures, expect that unforeseen failures are inevitable. Put contingency playbooks in place to address failure scenarios, in order to minimize downtime or performance degradation of your services.
  • Ensure proactive visibility into your sites, apps, and key dependencies to quickly know when to implement back up plans. Knowing when to trigger a backup plan will be critical to its success, and visibility into all application components including any third-party dependencies that make up your service delivery chain, will provide the most efficient approach to identifying when and what strategy to execute to mitigate service issues.

To learn more about DNS and how it works, be sure to download the Internet Fundamentals: Underlying Network Infrastructures Explained. You can also sign up for a ThousandEyes free trial to start gaining deeper insight into your service dependencies and their performance. Finally, to stay up-to-date on the latest Internet outage intelligence, be sure to subscribe to our podcast, The Internet Report.

Subscribe to the ThousandEyes Blog

Stay connected with blog updates and outage reports delivered while they're still fresh.

Upgrade your browser to view our website properly.

Please download the latest version of Chrome, Firefox or Microsoft Edge.

More detail