On July 22, at approximately 8:38AM PT, Akamai’s DNS—a critical service which directs users to its CDN edge—suffered an outage that prevented users around the globe from reaching its customers’ sites. Users attempting to browse sites hosted by Akamai received error messages indicating that the requested domain name could not be resolved to an IP address. The issue was resolved and service restored approximately one hour later at 9:45AM PT. You can read Akamai’s summary of the service disruption here.
Resolving domain names to IP addresses via the Domain Name System is a critical first step in reaching a web property. Although brief, the scope of the outage was significant, with a large number of websites and applications (ranging from gaming to major banks, airlines and more) completely or intermittently unreachable. While network connectivity to Akamai’s CDN edge infrastructure was available during the outage, without Edge DNS's authoritative nameservers to resolve domain names, the websites and applications became unreachable to users.
ThousandEyes observed a spike in web and application outages during the incident—all hosted on Akamai’s servers.
ThousandEyes further observed that Akamai’s DNS service was unable to resolve the domains hosted in Akamai's CDN.
Role of DNS in CDN Traffic Management
The Domain Name System (DNS) maps human readable domain names, such as “example.com,” into IP addresses. CDN providers commonly use DNS to load balance traffic across their infrastructure and direct users to the optimal edge server based on geographical proximity, server availability or performance, and other factors. To use the DNS in this way, the service provider must host DNS records for the sites in their CDN. Enterprises will typically configure a domain name such as “www” as a CNAME record, which may be pointed to additional CNAMEs, and ultimately ending with an A record which provides the IP address. This layered approach allows the service provider to control which IP address in the CDN a client receives, so as to optimize the client experience, and provides flexibility to change the IP addresses when needed.
For example, ThousandEyes uses Akamai’s CDN service to host our website “www.thousandeyes.com.” This domain name resolves to a CNAME in Akamai DNS's "edgekey.net" zone. That CNAME also resolves to an Akamai CNAME in the akamaiedge.net zone, and that name then resolves to an A record with the IP address needed to reach the CDN edge (see figure 3 below).
If the CDN’s DNS service were to become unavailable, then the CDN edge would effectively be unreachable as well. This is what happened during the Akamai DNS service disruption, although the impact of the outage varied across its customers and users based on multiple factors.
Impact of Outage Varied Widely
Akamai is one of the top global CDN providers, with a significant customer base ranging from large banks and SaaS providers to major ecommerce sites, such as Amazon and others. During the incident, ThousandEyes observed significant variation in impact across sites using its services, with some organizations maintaining greater availability than others.
For sites solely using Akamai’s DNS and CDN service, some end-users would see an Akamai-hosted site as unavailable throughout the DNS outage.
Those users saw DNS resolution errors in their browsers, and further troubleshooting (such as with ThousandEyes DNS tests) showed DNS SERVFAIL errors or no responses when attempting to reach authoritative nameservers in the Akamai DNS service.
Not every Akamai customer was similarly impacted. Amazon’s ecommerce site saw nearly no impact during the incident. Amazon differs from the customer shown above in that they use multiple CDN providers to host their site’s content and leverage their own DNS service to balance traffic across each of their CDN providers. This architecture has several advantages. CDN providers may have different geographic coverage or be optimized to deliver certain types of content. And a multi-CDN approach also increases site resiliency, so no individual CDN provider is a potential single point of failure. Amazon was able to distribute traffic to its providers throughout the outage in ways that appear to have spared their users from impact.
Amazon did not completely eliminate Akamai for content delivery during the outage. Some locations and content continued to be used, as seen in figure 7, without degrading user experience.
Another ecommerce provider was mostly available throughout the outage, but users to the site may have noticed longer page load times due to long waits to receive DNS responses, as seen in figure 8 below.
This ecommerce provider continued to use Akamai for its site’s root object, but leveraged other CDN providers for significant portions of its page elements.
The Akamai DNS outage is yet another reminder that outage outcomes are not solely the responsibility of external providers. Outages are inevitable, regardless of which provider or service is used. However, organizations can and should take measures to reduce risks to their digital business by considering redundancy for critical services and having plans in place to address inevitable, unplanned disruptions.
Here are the three top lessons to take away from this incident:
- Consider leveraging redundant providers for key services, such as CDN and DNS. Multiple CDN providers can increase service resilience, as well as improve performance for users. Akamai customers using multiple CDN providers were least impacted by this outage.
- Have back-up plans in place for when things inevitably go wrong. Even if you’ve implemented best practice, redundant service architectures, expect that unforeseen failures are inevitable. Put contingency playbooks in place to address failure scenarios, in order to minimize downtime or performance degradation of your services.
- Ensure proactive visibility into your sites, apps, and key dependencies to quickly know when to implement back up plans. Knowing when to trigger a backup plan will be critical to its success, and visibility into all application components including any third-party dependencies that make up your service delivery chain, will provide the most efficient approach to identifying when and what strategy to execute to mitigate service issues.
To learn more about DNS and how it works, be sure to download the Internet Fundamentals: Underlying Network Infrastructures Explained. You can also sign up for a ThousandEyes free trial to start gaining deeper insight into your service dependencies and their performance. Finally, to stay up-to-date on the latest Internet outage intelligence, be sure to subscribe to our podcast, The Internet Report.