A few months back, on March 18th, some users connecting to Fortnite: Battle Royale would have experienced first a total outage, then significant performance degradation, lasting over a course of about thirty minutes. The cause of the outage wasn’t due to an application issue, or even network performance. In this case it was caused by instability of advertised Internet routes—also known as BGP route flapping. It’s as if the entire last mile of highway (from every direction) to a football stadium simply vanished—leaving fans completely cut off.
Fortnite is a first person shooter game that revolves around a standard battle royale format—a hundred players enter, last player standing wins. Players are dropped on an island and have to battle out with all other players, as the available battleground slowly shrinks. Fortnite has become something of a phenomenon, with over 140 million users worldwide and over 92 million events processed every minute. The game has a free-to-play model, with much of the revenue for the platform coming from in-game purchases, such as custom skins and persona modifications. Game availability and performance are obviously critical to user experience and player adoption, but given the game’s model (the more users play, the more opportunity there is for them to make add-on purchases), performance is more directly tied to revenue.
Fortnite’s creator, Epic Games, has had its fair share of outages due to everything from unscheduled maintenance windows to application scalability, all of which have become increasingly high profile as the game has risen in popularity. The company’s detailed some of these outages, including a lengthy post-mortem published after the launch of 3.5 in April; however, there are others that haven’t received as much attention.
The March outage, which was captured using the ThousandEyes application, wasn’t application-related or lengthy, but it does illustrate the many dependencies of delivering gaming service via the Internet. When the outage began at around 6pm PST, users connecting to the Fortnite instance qos1.ol.epicgames.com via hosting provider Seflow, would have experienced a total disruption in gameplay. The Path Visualization view below shows massive packet loss occurring at every edge node connected to this hosting provider. This massive loss would have completely prevented some users from connecting to the game.
The cause of this outage wasn’t due to a collective failure of peering ISPs. Instead, routing updates originating from AS49367, the host provider, were disrupting the Internet’s ability to reach it. The continuous withdrawal and announcement of BGP route advertisements—also known as flapping—would have prevented users from connecting to game instances hosted out of that particular data center.
What’s interesting about this particular outage is that it’s unrelated to any architectural decisions Epic Games could have made. Epic Games has a multi-cloud strategy, with instances hosted out of AWS, Google Cloud, and potentially other providers. They’ve also taken steps recently to improve their ability to scale. But in this case, it wasn’t as if there was anything wrong with Fortnite’s servers. There they were, sitting ready in the data center, waiting for users to invoke processes for gameplay. Instead, this was about the “Internet map” users relied to get to them.
BGP is the routing protocol used by autonomous systems (ASes) to determine paths through the Internet. Each AS announces to the Internet a list of address prefixes it can get to, and path-vector algorithms determine the best way to reach any given destination. Routes are exchanged amongst peers, and routing updates received will be added to each AS’s routing table—which is effectively a list of known routes.
Internet route advertising is notoriously tenuous—effectively built on a chain of trust that spans every ISP on the planet, regardless of reputation. For instance, in 2017 there were nearly 14000 BGP routing incidents. Not all of these were malicious. Many were due to configuration errors or outages. Regardless of cause, routing issues have led to some high-visibility events, including a route leak by Google that effectively shut down the Internet for Japan in August of last year. Any disruptions in routing, whether due to a hijack, leak or flapping, can have considerable disruption. Route flapping in particular can have a ripple effect on performance because the continuous updating of route advertisements can overtax router CPUs and lead to a state of non-convergence.
Luckily, in this case the problem was quickly addressed, with the routes stabilizing within thirty minutes, which restored normal reachability and enabled to traffic to flow without issue.
Most game users wouldn’t have known that their inability to reach their gaming instance that day in March was likely due to a faulty router, which led to rapid route oscillation, which then led to neighboring ISPs unable to keep up with the continuous changes. The complexity of application delivery today means that are many dependencies must all work flawlessly together to ensure a good user experience. Key among them is BGP routing, which ensures reachability across the Internet. Fortnite, sitting isolated and alone on its data center island, was effectively the last one standing. Although in this case—no one was the winner.