Over the last several weeks, some of the most prominent digital companies like Google, Cloudflare, Amazon and most recently Apple experienced issues with the services they are offering. While the types of services each of these companies differ, the common thread between these incidents was that they were a direct result of problems with the Border Gateway Protocol (BGP)—the protocol that more than any other technology makes the Internet a reality. Of course the other commonality across these incidents was that they were quite costly for the affected companies and their users.
BGP events such as these are meticulously investigated and reported at least internally by each organization, and in some cases quite publicly. However, in the aftermath of all the analysis and hand-wringing about the vulnerable state of the Internet, not much ever seems to happen in the big picture to prevent further routing problems from recurring. That is the situation we find ourselves, decades after BGP’s inception.
Now, it’s not that there are no norms or built-in mechanisms for doing and making BGP right on the Internet. Over the years, methods such as maximum prefix limits, Internet Route Registry (IRR) based filtering and Resource Public Key Infrastructure (RPKI) have been defined and implemented. For more information on some of these methods, check out our earlier post on Best Practices to Combat Route Leaks and Hijacks.
Yet all of these best practice methods suffer from the same fundamental limitation—there’s no way to make these practices binding on all the networks that make up the Internet. The only way that best practices grow on the Internet is through social promotion and business pressure.
To that end, RIPE held a RPKI deployathon in March, a much-needed event that gave hands-on experience with RPKI technology to those who needed it the most - network engineers and operators. RPKI proponents have been active to raise awareness. In fact, if there was one positive thing that emerged as a result of recent outages, it was the fact that Border Gateway Protocol protection mechanisms got some real exposure, but especially RPKI.
Visualizing the Benefits of RPKI
The benefits of RPKI were easily observed during the outage that affected Cloudflare users. Cloudflare network engineering publicly shared and called others to share their traffic utilization (Netflow) graphs. Figure 1 and 2 show Cloudflare traffic transiting Verizon and AT&T respectively.
Figure 1 that shows Cloudflare's traffic transiting Verizon indicates a significant drop which was caused by heavy congestion in the Verizon network as a result of accepting leaked routes. By contrast, Figure 2 showing Cloudflare traffic transiting AT&T tells a different story—no impact whatsoever.
The reason for the difference is that prior to this incident, AT&T had moved forward with its RPKI adoption and implemented a policy wherein they started rejecting invalid BGP announcements.
A Small Scale Illustration of RPKI in Action
ThousandEyes provides multi-layer visibility from the app experience down to Internet routing. We have dozens of BGP routing collectors from which we receive BGP updates so we can track prefix reachability and path changes over time. Recently, we got notice of a partial loss of reachability for a customer prefix. We looked at our BGP visualization view as seen in Figure 4, where dotted red lines show where previously working paths became invalid. We noted that the change occurred three AS hops away from the originator. This told us that this change wasn’t due to some traffic engineering from the originating AS to its direct upstream ISPs such as via RFC 1998.
Moreover, the prefix was only unreachable from New York and Amsterdam BGP collectors. All other BGP collectors could reach the prefix. What could explain this anomalous behavior? Only two providers—KPN (AS 286) and AT&T (AS 7018) withdrew the prefix, whereas other Tier-1 providers such as Level3 and Telstra as upstream providers did not. Well, as it turns out, both AT&T and KPN reject RPKI invalid prefixes. As noted above, AT&T rejects RPKI-invalid routing, and KPN calls out RPKI support in their routing policy page.
Rejecting RPKI-invalid announcements means that if the ISP receives a routing announcement that doesn’t match the Route Origin Authorization (ROA) that maps prefix and ASN combos, then the ISP will not receive, utilize and propagate the announcement. For example, if some rogue AS sends an announcement claiming to originate a prefix that doesn’t belong to it, if there’s a ROA, then an ISP that rejects RPKI-invalids will refuse that announcement.
In this case, there wasn’t anything quite so sinister at work. We got in communication with the AS that originated the prefix and found out that they had unfortunately made a configuration error in their ROA such that their own announcement wasn’t in compliance with the ROA and then triggered an invalid-reject from both AT&T and KPN. Once the error was fixed and ROA propagated, prefix reachability vai AT&T and KPN restored to normal, as seen in Figure 5.
Even though this was a very small scale and inadvertent event, it showcases how effective RPKI-based route filtering is.
How to Help Internet Routing Hygiene
Wide-scale adoption of RPKI will go a long way to cleaning up Internet routing and make it more secure. How can you help? If you’re a provider, implement strict filtering based on RPKI. If you’re an enterprise, put strict routing announcement filtering based on RPKI down as a requirement in your RFIs or RFPs for ISP services. The more market pressure ISPs receive, the more they’ll be motivated to adopt best practices that benefit everyone.