Data centers are the factories of the Internet, churning out computation and services that customers demand. Apple is currently converting a 1.3M sq ft factory into a data center. And like factories, data centers rely on critical services to stay running. In the case of a data center, the most important inputs are electricity, cooling and network connectivity. Typically, redundant supplies of each input are available should there be a problem. Let’s examine an outage that occurred yesterday to see the importance of monitoring data center ingress, or network connectivity into the data center.
Steam Outage on May 7th
Starting at 11:45 Pacific on May 7th and lasting for 2 hours, the popular online gaming service Steam suffered a widespread outage. Steam regularly has more than 8 million concurrent users, so it’s one of the bigger web services out there (Figure 1).
There were reports of users around the world of being unable to login or use the Steam store. Let’s break down what happened.
A (Nearly) Global Outage
We were monitoring the Steam API at the time of the outage. You can follow the interactive data here. The first thing we noticed was a dramatic drop in API availability, as shown in Figure 2.
The Steam API was unavailable from more than half of the 15 locations we tested over the 2 hour outage (Figure 3).
So what went wrong? Let’s dig into the diagnostics. The first clue is in the HTTP errors that we recorded. They were all TCP connection errors, typically indicating some sort of network error or congestion (Figure 4).
To confirm this, we can take a look at packet loss and latency. We see that packet loss spikes at the same time that availability dips, from 11:45am-1:45pm (Figure 5).
Troubles in the Data Center
So we have a situation with high levels of packet loss from many locations around the world. Let’s take a look at a Path Visualization to see where the network instability is coming from.
While Steam content delivery is handled from many distributed locations, the authentication and storefront services for the platform are served out of a Seattle-area CenturyLink data center. We can see this data center in the visualization in light green on the right hand side (Figure 6).
Steam’s data center is typically served by five primary ISPs. During the outage, traffic paths from Comcast and Qwest are successfully reaching the Steam data center, but paths from other upstream ISPs—Level 3, Telia and Abovenet/Zayo—are not (Figure 7).
Only Some Roads Lead to Seattle
When traffic from some networks completely terminates, while traffic from others is properly delivered, its usually a physical network failure (broken link or router in an IXP) or an issue with routing. So let’s look into the routing plane. At the BGP layer we can see massive routing instability (Figure 8).
Looking at the AS paths themselves, we see over time that Qwest is being preferred while other routes to Steam (Valve network AS32590) are no longer advertised (Figure 9).
Based on the evidence, our best guess is that a routing configuration change by Steam was the cause of the outage. We will update this if Steam releases a post-mortem.
Monitoring Data Center Ingress
Keeping the network flowing to the data center is critical. Without the network your data center, at least for most applications, is as good as useless. Understanding the ingress routes to your data center is just as critical as checking on the physical fibers that rise up through the data center vaults. You can easily monitor your own data center ingress, including Path Visualization and Route Visualization shown above. Get started with a free trial of ThousandEyes today.