CenturyLink/Level 3 is a major global ISP, peering with many app providers and enterprises, including Google, Cloudflare and OpenTable, making the blast radius of this outage extremely wide, as the provider effectively terminated a large portion of Internet traffic around the world. While the service impact of the outage was total packet loss across CenturyLink’s geographically distributed infrastructure, the cause of the outage was reportedly due to a crippled controlplane brought on by the internal propagation of a faulty BGP announcement.
Outage Root Cause: The Role of BGP and Flowspec
CenturyLink acknowledged the outage in an status update posted to Twitter at 12:33 UTC (9:33AM EST) — nearly three and half hours after the outage began and followed up with preliminary root cause post recovery.
An “offending flowspec announcement” prevented the establishment of Border Gateway Protocol sessions across elements in its network — a catastrophic scenario given the role BGP plays in internal traffic routing, as well as in Internet routing (peering and traffic exchange between autonomous networks). Flowspec is a BGP extension (essentially feature addition to the BGP-4 specification) that is used to easily distribute firewall-like rules via BGP updates. It’s considered a powerful tool for quickly pushing filter rules to a large number of routers and, historically, it’s been used for mitigating DDoS attacks. It functions similarly to access control lists (ACLs), but unlike ACLs, which are static, flowspec, as part of BGP, is dynamic. The dynamic nature of flowspec means that, while powerful, it can cause significant issues if improperly configured.
In an expanded analysis provided to its customers, CenturyLink has indicated that the improperly configured flowspec was part of a botched effort to block unwanted traffic on behalf of a customer — a routine internal use case for flowspec. In this instance, the customer requested CenturyLink block traffic from a specific IP address. The flowspec for this request was accidentally implemented with wildcards, rather than isolated to a specific IP address. This misconfiguration, along with the failure of filter mechanisms, and other factors, ultimately led to the incident. The timing of the outage (~4AM MDT relative to Level 3 headquarters in Denver, Colorado), suggests that this announcement could have been introduced as part of routine network updates, which typically take place during early morning hours to avoid broad user impact.
When the offending flowspec rule was received (likely as part of a broader rule list) and executed when the router got to that rule in the list, BGP sessions would be dropped, which would sever the controlplane connection, including communication of the offending flowspec rule. Since BGP is dynamic and the flowspec rule would not persist past termination of BGP, the routers would then attempt to reestablish BGP, at which point they would receive BGP announcements, including the offending flowspec rule. At this point, the routers would go down the rule list, implementing as they go along, until hitting, yes, the dreaded flowspec rule, and the controlplane connections would terminate. If this sounds like a nightmarish infinity loop, you’re not wrong.
The cause of the outage would account for its extended nature — a nearly unprecedented 5 hours — as efforts to remediate the outage could have been challenging given the looping state of the network and the rejection of BGP as described above.
In Cloudflare's analysis of the incident (which impacted reachability of its services), it noted that during the outage, there was a substantial increase in BGP announcements over typical levels (based on RouteViews data). The increase in announcements aligns with CenturyLinks statement — and ThousandEyes-observed announcement behavior. As CenturyLink routers attempted to establish BGP after session termination, they would have resent route announcements to its peers. Given that CenturyLink is a major Tier 1 ISP and is densely peered with other providers (who are themselves densely peered) the cascading impact of the re-announcements, and their subsequent propagation across the Internet would have been enormous.
How the Outage Unfolded
ThousandEyes observed the massive scale of the outage through an aggregate view collected across our global sensor network. At the start of the outage, a few minutes after 10AM UTC, ThousandEyes detected traffic terminating on a large number of interfaces on Level 3 (CenturyLink) infrastructure, as well as in other ISP networks on nodes directly connected to Level 3. Figure 3 below shows Level 3 as the locus of the incident, with 72 interfaces affected just as the outage unfolds. At the peak of the outage, nearly 522 interfaces were impacted, including Level 3, as well other ISPs on their peering connections with Level 3.
- Follow along with this interactive visualization.
Level 3’s network is heavily concentrated in the United States, although they also have a global presence. The geographic scope of the outage, as reflected in figure 4, shows the simultaneous effect of the incident on its infrastructure, regardless of location.
Outages caused by controlplane issues, which is the case with this incident, are often massive in scale. Previous large-scale outages, such as Google's partial network outage last year, or Comcast’s nationwide outage two years ago, were caused by damage (or downing) of the network’s controlplane.
Internal controlplane issues led to traffic getting dropped once it reached CenturyLink’s network; however, the symptoms of its aberrant network state were also evident from an external BGP standpoint, as ThousandEyes witnessed throughout the 5 hour incident. An example of Level 3’s unstable BGP can be seen below, in a BGP visualization of ThousandEyes’ routing paths which, before the incident, were announced through its two service providers, Level 3 and Zayo Bandwidth.
At approximately 10AM UTC, as the outage began, Level 3 started route flapping (issuing BGP changes in rapid succession, such that traffic routing becomes unstable). This behavior is likely the external manifestation of the flowspec rule described above, where BGP sessions are torn down and quickly reestablished once BGP is terminated. This same flapping behavior repeated at regular intervals throughout the incident, though the scope of the route changes diminished as more providers de-peered, or at least no longer propagated routes from Level 3.
Traffic termination is certainly problematic, but what made this outage so disruptive to Level 3’s enterprise customers and peers, is that efforts to revoke announcements to Level 3 (a common method to reroute around outages and restore service reachability) were not effective, as Level 3 was not able to honor any BGP changes from peers during the incident, most likely due to an overwhelmed controlplane. Revoking the announcement of prefixes from Level 3, or preventing route propagation through a no-export community string and even shutting down an interface connection to the provider would have been fruitless.
...but what made this outage so disruptive to Level 3’s enterprise customers and peers, is that efforts to revoke announcements to Level 3 (a common method to reroute around outages and restore service reachability) were not effective, as Level 3 was not able to honor any BGP changes from peers during the incident...
To bypass Level 3 and prevent traffic from getting routed through them, ThousandEyes withdrew all announcements from Level 3, as well as blocked incoming announcements from the provider which, together, were meant to prevent any ingress or egress traffic hitting their network. Simultaneously, ThousandEyes began announcing its /24 prefix to Cogent, as a replacement provider to Level 3. The establishment of Cogent in the path (and Level 3 route instability) can be seen in figure 5 below. ThousandEyes took the additional step of shutting down the interface connected to Level 3, effectively de-peering with the provider.
Despite these remediation efforts, CenturyLink continued to announce, to its peers, routes to ThousandEyes (and its other customers) as they were before the incident — effectively announcing stale, no longer legitimate routes.
As three service providers were then announcing routes to the ThousandEyes’ service (though only two legitimately), the number of paths announced increased over pre-outage levels.
As the incident continued over several hours, ISPs, such as NTT, began de-peering with Level 3. The below animation shows a timeline of peering changes involving NTT.
NTT begins peering with Cogent instead of Level 3 at 13:00 UTC (9AM ET) — approximately three hours after the start of the outage. Level 3 announcements do not reflect NTT’s revocation of routes until the incident is resolved and CenturyLink regains control of its network, at which point, the BGP path reflects NTT’s peering change from Level 3 to Cogent.
Another major ISP, Telia, indicated that at approximately 14:00 UTC CenturyLink requested the provider no longer peer with its network in order to reduce the number of route announcements it received (a move possibly meant to stabilize its network). Once the incident was resolved, CenturyLink requested Telia re-establish peering. Route flapping can exact a significant toll on routing infrastructure, and de-peering from CenturyLink would have reduced the number of BGP updates it received from its peers, allowing CenturyLink’s control plane to stabilize and break the looping pattern described earlier.
While the above examples focus on some immediate peers of Level 3, it's important to keep in mind that the Internet is a web of interdependencies and even if you or your immediate peers were not directly connected to Level 3, you and your users could have been impacted. Internet routing is highly dynamic and influenced by a number of factors that include path length, route specificity, commercial agreements, and provider-specific peering preferences. Many users, services and ISPs impacted as a result of this outage were not customers or direct peers of CenturyLink/Level 3, yet found that their traffic was routed through that provider at some point.
How the Outage Impacted Services
Traffic routing in the wilds of the Internet is highly complex, and influenced (not controlled!) by a number of factors. Keeping the broad impact in mind (beyond customers and peers), we’re next going to examine two enterprises leveraging Level 3 as their service provider, how they were differently impacted during the course of the outage, and learnings based on our analysis.
The first service, OpenTable, was significantly impacted for the duration of the 5 hour incident, with packet loss remaining high throughout.
- Follow along with this interactive visualization.
Figure 8 below shows traffic inbound from multiple geographic vantage points around the globe getting dropped before reaching its service.
Another service, GoToMeeting, fared far better during the incident, partly due to its remediation efforts, but also simply due to peering luck.
- Follow along with this interactive visualization.
GoToMeeting was only actively announcing routes through a single service provider — Level 3. However, shortly after the outage began, it stopped announcing routes through Level 3 and brought its backup provider online — GTT. While Level 3 continued to announce stale routes to GoToMeeting, most traffic began to get routed through GTT. The prefixes announced by GoToMeeting were identical to those that had been announced through Level 3 (i.e. they were not more specific routes); however, routes announced through GTT seemed to be preferred to those of Level 3. The reason why GTT routes were preferred to Level 3 (mitigating the impact of the outage on GoToMeeting) likely came down to GTT’s peering density and peering relations that may have made GTT routes more attractive (for cost or other reasons).
Could Enterprises Have Mitigated the Impact?
This outage was extremely unusual, not only in terms of scope, but in how customer remediation efforts were thwarted; however, there were some concrete steps that enterprises could have taken to reduce service impact. In the case of traffic ingressing an enterprises network, routing paths can be influenced but not completely controlled. The exit path from an enterprise network, however, can be controlled.
During the course of the incident, some traffic routed through service providers other than Level 3 was reaching services, but getting dropped by Level 3 on the reverse path. Keeping in mind asymmetric routing, if enterprises had not only revoked advertisements to Level 3 (which were ignored by the provider), but also stopped accepting route announcements from Level 3 and shut down peering, they could have reduced the impact on their traffic. Changing local preferences would also have been another (though not quite foolproof way) to send traffic to the provider that is operating normally and not the provider dropping traffic.
From an inbound path standpoint, if Level 3’s announcement behavior during this incident has a parallel, it is to that seen with BGP hijacking, where illegitimate routes are being announced by an AS. To counter the illegitimate announcements, an enterprise could have begun announcing more specific routes to its service. For example, if it was announcing a /22, it could have started announcing four /24s through the unaffected provider, which would have made it the more preferential path, effectively steering traffic away from the outage. Not every enterprise may have this option; however. In that instance, splitting a /24 into two /25s is possible, although, some research has indicated that routing irregularities may occur with this prefix, as some providers do not accept announcements smaller than /24. Enterprises using anycast may also have to be careful about changing route specificity, as that could create a traffic distribution imbalance that could subsequently impact their services.
Lessons and Takeaways
The dynamic, uncontrolled (and contextual) nature of Internet routing was on full display during this incident, underscoring the significant impact of peering and provider choices — not only your own, but those of your peers and their peers. The deeply interconnected and interdependent nature of the Internet means that no enterprise is an island — every enterprise is a part of the greater Internet whole, and subject to its collective issues. Understanding your risk factors requires an understanding of who is in your wider circle of dependencies and how their performance and availability could impact your business if something were to go wrong.
Maintaining visibility into the routing, availability, and performance of your critical providers is also extremely important, as external communication on status and root cause can vary widely by provider and is often slow to arrive. When it does, it may be past its usefulness in addressing an issue proactively.
Finally, consider the context of the outage (and any outage). In the case of this CenturyLink/Level 3 incident, the timing of it dramatically reduced its impact on many businesses, as it occurred in the early hours (at least in the U.S.) on a Sunday morning. Perhaps, that’s one bit of good news we can take away from this incident. And we all could use some good news about now.
If you’d like to get an inside view into how your providers and peers are working for you, sign up for a free trial.
Want to learn more? Archana and I also discussed the CenturyLink outage on this week's episode of #TheInternetReport, our weekly show covering what's working and what's breaking on the Internet -- and why.