On the morning of December 27th, ThousandEyes monitors picked up a major service outage on the CenturyLink network impacting AS 209. A quick scan of social media revealed that users all across the US were reporting issues with not just DIA services but also critical services like E911 and traditional voice communications. The outage took over 50 hours to fully resolve and impacted thousands of businesses and individuals all across the US.
So what really happened? The official Reason For Outage (RFO) analysis from CenturyLink points to issues with a bad network management card in Denver, CO.
How does a single network management card cause a massive network-wide failure? While we don’t know the exact details, the symptoms and the explanation most closely align those of an optical network outage, which has the potential to disrupt E911 services along with the Internet. Optical switches typically rely on an out-of-band management network for provisioning and remote management. This network was impacted during the outage, which further compounded their ability to remotely troubleshoot and recover from the failures.
Five Key Takeaways
Enterprises today are increasingly consuming Software-as-a-Service (SaaS) applications—which are primarily delivered over the Internet—and are actively evaluating or migrating to hybrid and SD-WAN architectures The Internet is a mission-critical dependency for any enterprise and there are valuable lessons to be learned from this outage.
1. Visibility, Visibility, Visibility — You can never have too much visibility into the networks and services your employees and customers rely upon. The Internet is a mission-critical network and you need to know how it’s performing so that you can effectively mitigate outages like these. CenturyLink lost visibility into their optical switches which severely impaired their ability to diagnose and recover from the failure. However, their customers who had service resiliency and sufficient visibility were able to detect and route around the failure and preserve their services. Figure 3 below shows a BGP visualization from ThousandEyes where an enterprise has routed away from CenturyLink to another ISP.
2. Resiliency — Is your redundant service really redundant? We saw numerous reports on Reddit where data centers had dual ISP connections but they were both down during the outage. If you purchased redundant service from the same vendor, there is no guarantee the two circuits will not end up on the same fiber cable, or even the same fiber at some point. Furthermore, you are exposed to system wide outages that affect a particular vendor as we saw in this case. What you really need is vendor redundancy.
3. Can SD-WAN save you? — The short answer is yes, provided you’re using truly redundant Active/Active Internet connections. SD-WAN technology is good at detecting hard down outages such as these. We can think of these as first order failures, where the fiber goes dark, or the next-hop IP address is unreachable. It’s very clear that the router needs to fail over to the backup circuit. However, this decision is less clear when the primary Internet circuit is impaired or degraded. The problem may be a few hops away, or even a few ISPs away. It may only impact performance for one application—ex. Sharepoint Online—while all other apps work just fine. That’s a much harder decision for the SD-WAN router, and that’s where deep Internet visibility and human judgement have no substitute.
4. Manageability — Can your network elements reach their controllers? Lately we have seen a shift towards centralized or cloud-based orchestration and management of network elements, SD-WAN routers and WiFi Access Points. This is immensely valuable to enterprises because centralized orchestration enables better network automation. However, what happens when the network fabric connecting these elements to their controllers is impaired? A 50 hour outage could ensue, just like what happened with CenturyLink when its centralized management and control plane was disabled.
When planning your hybrid and SD-WAN enterprise network strategy, while cloud-based management provides tremendous value, ensure that you have adequate visibility into the network paths both inside and outside your enterprise perimeter so you can recover from issues that hit at the fabric of your centralized orchestration.
5. Contingency planning — A 50 hour outage is akin to a 100 year flood. However, both these things are happening with increasing regularity. Make sure you have a plan in place for when your data center is indeed down to one working Internet circuit for several days. Consider tertiary providers who can turn up circuits in short order. Make sure you have enough capacity to handle peak loads for extended periods of time on a single circuit. And audit your suppliers regularly so you have the most reliable service your budget allows.
Get Ahead of Internet Risks Now
For better or worse, your business is likely becoming dramatically more dependent on the Internet. There are some very simple lessons about redundancy to learn from the CenturyLink outage, but just mastering the basics of redundancy doesn’t mean you’re achieving optimal IT and network operations in the cloud era. To learn more, check out the plethora of educational resources in our website and blog. If you’d like to better understand these dynamics in context of a specific challenge, contact us and we’ll be happy to review some best practices recommendations and how to thrive in the face of widespread Internet outages.