On Monday morning, November 6th 2017, from 9:45am-11:25am Pacific, Comcast suffered a nationwide outage sending several million users of the popular Internet Service Provider into a frenzy. Our analysis has revealed that Level 3 leaked over a thousand routes belonging to Comcast subsidiary networks and their customers. The immediate effect was that instead of traffic going through the Comcast backbone to reach these networks, traffic was forced to go via Level 3. The result for all the attempted communication was a significant spike in packet loss and latency, causing many Internet services and applications like Slack and Webex to become functionally unusable. In today’s era of interconnected networks and Internet reliant businesses, yesterday’s outage is yet another reminder of the complex dependencies and vulnerability of the Internet.
Observing The Impact of the Outage
On Monday morning at 9:45am, we noticed that ThousandEyes employees connecting remotely through the Comcast network were unable to use productivity and collaboration tools like Slack, Gmail and Webex. At the time of the outage, we noticed multiple instances of packet loss within the Comcast network distributed across various regions. The impact of the outage was widespread across North America, equally impacting the east and the west coasts. ThousandEyes remote employees connecting from Chicago and Miami, monitoring services hosted outside of the Comcast network from Endpoint Agents on their laptops were consistently showing packet drops within the Comcast infrastructure in New York and Atlanta respectively, as shown in Figures 1 and 2.
We also observed packet loss for a service hosted within the Comcast infrastructure. Figure 3 below shows packet loss within the Seattle/Tacoma Comcast network along with a spike in packet loss within the Level 3 network.
Initial Signs of Level 3 Muddying the Waters
Beyond the packet loss within Comcast, the more interesting data point from Figure 3 is that we are now starting to see packet loss within the Level 3 (AS 3356) network as well. Level 3 later confirmed that a misconfiguration was the root cause of the Comcast outage, while Arbor networks commented that the misconfiguration was related to a BGP route leak. Our analysis did point in the direction of a route leak by Level 3 that adversely impacted traffic from and to the Comcast network.
Beginning at 9:30am Pacific, tests targeting services within the Comcast network started witnessing inconsistencies in BGP routing, that manifested in the IP forwarding layer through Path Visualization. An Agent-to-Agent test targeting an Enterprise Agent hosted within the Comcast network began observing elevated amounts of packet loss during the time of the outage, as shown in Figure 4. To see the interactive data, feel free to explore this share link.
Right before the outage, as seen in Figure 5, we noticed that traffic from the Canada and Seattle Cloud Agents, destined to the Enterprise Agent located in Comcast network (AS 33650) traversed through the Comcast backbone AS 7922.
However, at the time of the outage, traffic from these agents located in Canada and Seattle were rerouted via Level 3, while being subjected to increased latency and a spike in packet loss, as shown in Figure 6.
Level 3 Leaks a Bunch of Routes and Modifies AS Paths
So why did traffic get rerouted via Level 3 ? Beginning at 9:30am Pacific, Level 3 (AS 3356) leaked more specific routes to the Comcast ASN 33650 which forced traffic through Level 3. Route leaks involve the illegitimate advertisement of prefixes, blocks of IP addresses, which propagate across networks. However, unlike BGP hijacks, BGP route leaks are not always malicious and are usually inadvertent and due to misconfigurations. Route leaks are prone to propagation when a more specific prefix is advertised.
In this particular case, Level 3 leaked more than a thousand specific prefixes of Comcast subsidiary networks and their customers. Those 1000+ prefixes typically are routed via the Comcast backbone AS 7922. When Level 3 declared to the Internet that it was the best way (AS Path) to get to these prefixes, traffic for all of these networks started going through Level 3 instead of the Comcast backbone. Packet loss and increased latency ensued. In one of the many examples of route leaks we detected, as shown in Figure 7, Level 3 inserted itself into the AS path by advertising a more specific prefix 220.127.116.11/16 (versus the less specific 18.104.22.168/8 that is advertised by the Comcast backbone). When the neighboring ASes saw a longer matching prefix, the laws of BGP dictated they pick the path via Level 3 to reach Comcast, instead of being routed through the AS 7922.
At 11:25 am Pacific, Level 3 withdrew these leaked routes, re-establishing the peace and quiet of the Internet. People across the land rejoiced greatly to get their apps and services back, but many organizations likely were left with questions about how the Internet supports their business.
Stay Alerted on Route Leaks and Outages
The fact is that BGP route leaks and hijacks are becoming increasingly common in an Internet-centric environment. For more on route leaks, check out our previous posts: Level 3 Outage, Amazon AWS Route Leak. Yesterday’s outage was yet another reminder of the vulnerability of the Internet and the overarching impact on services and businesses that rely on the Internet. If you’re interested in monitoring routes to your network, detecting issues in upstream ISPs or cloud providers and setting up precision-guided alerts, sign up for a free 15 day trial and try it for yourself.