If you’ve spent most of your time dealing with enterprise or carrier networks, one of the things that is surprising is the sheer unpredictability of the Internet. Yesterday from 10:50am to 11:30am BST and from 1:10pm to 2:13pm BST, ThousandEyes detected service disruption to WhatsApp for a limited set of users around the globe. But this issue had nothing to do with WhatsApp’s service itself, and everything to do with how easy it is for Internet glitches to dramatically affect service availability and delivery.
WhatsApp was one of many services impacted as a result of a major BGP route leak by Swiss colocation provider, Safe Host. During the two instances of the outage, Safe Host leaked thousands of prefixes which had a cascading effect on the availability of those services when the routes were accepted and propagated by service providers, such as China Telecom, and then further accepted by other ISPs such as Cogent.
Unlike BGP hijacks, BGP route leaks are often benign from the perspective of service disruption except where the route change steers traffic to an ISP or a destination that will blackhole traffic. In this particular instance, the impact of the route leak was elevated as incorrect routes were accepted and propagated by China Telecom, an ISP provider known to have aggressive filtering policies. Unfortunately, the route leak also manifested as packet loss and outages in the peerings between China Telecom and providers such as Cogent as shown in Figure 2 below.
First Signs of Disruption
At 10:50am BST, ThousandEyes vantage points detected a dip in availability of WhatsApp services from locations in Dublin, Ireland and Las Vegas, Nevada, as shown in Figure 1. While the blast radius of this service disruption seemed contained, Connect errors indicated a network connectivity error. Throughout both instances of the outage, the symptoms stayed the same, manifesting as connectivity issues to the service that was sparked due to a nearly 100% packet loss of traffic trying to access the WhatsApp service.
We were able to confirm that the service disruption was indeed a byproduct of packet drops in the network layer by looking at Path Visualization, which illustrates the hop-by-hop path traversed from ThousandEyes vantage points to WhatsApp’s service hosted in IBM SoftLayer. Cogent’s routers in London appeared to be dropping packets as they were the last hops in the Internet path that handed traffic off to China Telecom (Figure 2), an ISP that did not exist in the end-to-end network path before the disruption (Figure 3).
Upon further analysis, we were able to triangulate the root cause of the disruption to China Telecom dropping packets (Figure 4) because of a BGP route leak.
Signatures of a BGP Route Leak
The introduction of an improbable ISP in the path raised suspicion of this being a possible BGP routing glitch. BGP route leaks involve the incorrect advertisement of prefixes, or blocks of IP addresses, which propagate across networks and lead to incorrect or suboptimal routing. Route leaks can happen from an Autonomous System (AS) originating a prefix that it does not actually own or an AS announcing that it can deliver traffic through a route that should not exist. If you are interested in understanding how a BGP route leak occurs, read more here. Back to the incident at hand, what exactly happened?
Service disruption to WhatsApp was triggered when a Swiss colocation company called Safe Host announced to the Internet that the best way to reach WhatsApp and thousands of IP prefixes was through its network, AS 21217. When Safe Host advertised these routes, they were accepted by China Telecom and further propagated through other ISPs such as Cogent in the case of WhatsApp. Hence when traffic destined to WhatsApp reached Cogent, based on the newly accepted routes traffic was routed to China Telecom. Once traffic entered the China Telecom backbone, we saw significant packet loss, possibly due to aggressive filtering policies of the Great Firewall.
ThousandEyes BGP Route Visualization (Figure 6) also picked up on the BGP route leak when we noticed a new AS path via China Telecom and Safe Host to a few of the affected prefixes.
The Takeaway
BGP route leaks are not uncommon on the Internet. However, the business risks associated with such route leaks and other Internet flaws are greater given the modern enterprise and service delivery landscape. When you rely on the Internet, an ecosystem that is vulnerable and deeply interconnected, a glitch in one part of the infrastructure can have cascading effects on another.
The key take-away here is that in a cloud-centric world, enterprises must have visibility into the Internet if they’re going to be successful in delivering services to their users. Most enterprise IT teams are still not aware of how different the Internet is as an infrastructure as compared to carrier and enterprise networks, and are unprepared for such an unpredictable environment. This incident shows how ridiculously easy it is for a simple error to dramatically alter the service delivery landscape in the Internet. If you can’t see what’s happening, you can’t hold providers accountable and solve problems.