On July 6, 2016, Pokémon GO, the augmented reality gaming phenomenon, launched in the United States, and millions became Pokémon trainers with the dream to “catch 'em all.” By July 7, Pokémon GO was the most downloaded and top-grossing iPhone game in the United States. At the time of this writing, the GPS-based game had generated $35 million in revenue from 30 million downloads.
The journey to the top was not an easy one, as the app encountered multiple outages as it tried to keep up with the multiplying user interest. Pokémon GO’s stumbles highlight the many obstacles that a newly popular service can face, from network and server overload to application-level issues. In this post, we’ll dive into the data from two of Pokémon GO’s most recent and severe outages, which you’ll see are very different in nature.
Overload Issues at Pokémon GO
On the morning of July 16, Pokémon GO experienced one of its first severe outages, which lasted for about four hours and was likely a result of both overloaded servers and network infrastructure. During the outage, Pokémon GO players began seeing an “Unable to connect to server” error message when they opened the application.
Just hours before the errors appeared, Pokémon GO was released in 26 European countries, significantly expanding the game’s user base. At the same time, a hacking organization called PoodleCorp claimed responsibility for a distributed denial-of-service (DDoS) attack on Pokémon GO servers. In sum, it’s likely that the resulting high traffic loads from either a large-scale DDoS attack or a large expansion of the user base (or a combination of both) overloaded their servers and network infrastructure, as we’ll see below.
To set up the test to Pokémon GO’s application, we used the Charles proxy application as a MITM device that intercepts HTTPS traffic to determine the primary endpoint URL that the application connects to. ThousandEyes set up an HTTP Server Test targeting https://pgorelease.nianticlabs.com/plfe, with network measurements enabled. This domain name resolves to a unicast IP address within Google Cloud’s infrastructure (Google is also an investor in Niantic Labs, the organization behind Pokémon GO). To follow along with our outage analysis, visit this share link.
If we first look at the data from before the outage occurred, the Table tab in the Network End-to-End Metrics view provides some basic information on the server infrastructure. Latency, which is measured from Cloud Agents located around the world, increases with geographical distance, suggesting that Pokémon GO’s server infrastructure is centralized in a single location in the U.S.; anycast IP addressing is not in play here.
The Path Visualization shows that once packets enter the Google network (blue-green nodes below), they are routed via MPLS (dotted blue links) to the target with a unicast IP address.
Starting at 5:25am Pacific, the network began experiencing issues — ThousandEyes Cloud Agents began reporting elevated levels of packet loss, peaking at 100%. The outage lasted for over 4 hours, as attempts by Cloud Agents to connect to Pokémon GO servers consistently experienced connection failures within Google’s infrastructure.
Because average packet loss was not consistently 100%, the servers were not completely offline during most of the outage. However, the few lucky Pokémon trainers who were able to log in reported that the application often froze, making the game virtually unplayable.
At the same time as severe packet loss issues, the Pokémon GO app saw availability dip to 0% for much of the outage.
The Path Visualization gives a detailed picture of the underlying network issues. Packets from Cloud Agents in different regions enter Google’s infrastructure at different entry points. However, once within the Google network, packets are routed to a single IP address. Under high traffic loads, it’s likely that this network architecture created a bottleneck condition, or target servers were no longer able to handle the load effectively, or a combination of both. As a result, though the hops before the target were able to forward packets on, at interfaces closer to the target, either these packets were dropped or their responses were lost, resulting in high forwarding loss at the red nodes below.
So what actually failed during this outage? Based on what we saw in the Path Visualization, it’s likely that the target servers were overloaded, causing backup on the links and interfaces leading to those servers. It’s also likely that the network infrastructure itself experienced congestion under heavy traffic loads, or a combination of both overloaded servers and network infrastructure.
Application Failures at Pokémon GO
On July 20, Pokémon trainers around the world catching 400 Magikarp candy were interrupted by yet another worldwide outage that lasted for about 5 hours and 30 minutes, this time likely caused by application issues. To see the data from this outage, view this share link.
According to Pokémon GO, the worldwide outage was caused by the release of their version 1.0.3 update. The release was expected to fix various bugs and issues within the game. On the user side, Pokémon trainers began seeing errors displaying the text “Failed to get player information from the server” or incomplete game content.
Digging into the data, we first see that the service had significant availability issues, with lows at 0% availability.
To check for network issues, we looked at the Path Visualization view. Though there are a few packet loss spikes, the overall loss and latency measurements were within expected ranges, absolving the network from any blame for the outage.
Coming back to the application layer, we see that at the start of the outage at 12:15pm Pacific, many Cloud Agents reported HTTP Receive errors. “Receive time” is the time elapsed between sending an HTTP request and receiving the response (time from first byte to last byte of the payload). The Error Details in the Table tab indicate that the agents didn’t receive a response from the target within the 5 second timeout.
As we move across the timeline, more and more agents started reporting HTTP Receive errors over multiple consecutive rounds of measurement, while packet loss and latency measurements remained within expected values. From the HTTP Server view, we can see that the agents were able to resolve the domain pgorelease.nianticlabs.com to an IP address, establish a TCP connection with the target and send a GET Request. All errors occur in the Receive phase, pointing to the likelihood that the target failed to respond to the GET requests.
Another indication that the outage was entirely a server-side issue is seen in the response HTTP headers before and after the outage. Before the outage, the Last-Modified field was dated Tue, 19 July, 19:08:08 GMT. As per RFC 2616, the Last-Modified field indicates the date and time at which the origin server believes the retrieved page was last modified.
Then, around 12:40pm Pacific, two Cloud Agents, Sydney and Melbourne, first reported HTTP 500 Errors, and a number of other agents followed suit in later test rounds. Most of the remaining agents continued to observe HTTP Receive errors.
Around 1:10pm Pacific, New Delhi, India was the first to report a change in the Last-Modified response header, and by 2:50pm, the change had been propagated to all of the agents. The HTTP 500 errors, together with the change of the Last-Modified field, indicate that the Pokémon GO team was likely working on server-side issues during that time.
Despite the changes, the agents continued to see errors. At 4:10pm Pacific, the majority of Cloud Agents began reporting 502 Bad Gateway errors, indicating an invalid or empty response from a server, possibly within the master/slave architecture.
At this time, the team was likely working on the servers to fix issues caused by the Pokémon GO version 1.0.3 update. After the outage was resolved around 5:35pm, the Last-Modified response header again changed to Wed, Jul 20, 16:52:20 GMT, likely indicating a fix for the HTTP errors.
Though many Pokémon GO trainers have expressed frustration over the game’s frequent outages on social networking websites and public forums, the game is still attracting millions of players around the world. Pokémon GO continues to expand globally: on July 22, the game was officially launched in Japan.
These two very different outages experienced by Pokémon GO bring to light some important issues. First, secure and strengthen your network by adding more network capacity and implementing effective server load balancing, while considering technologies like anycast to make your network more resilient. DDoS attacks that we’ve seen in the past have become increasingly adept at crippling the networks of large organizations. Secondly, make sure to use a replica of your production environment to test changes before the final update. Finally, monitor your network and applications like a hawk to detect unexpected or irregular behavior in real time.
To start monitoring the services and applications most important to you — including Pokémon GO — sign up for a free trial of ThousandEyes today.