Now that we’ve rung in the new year, let’s look back at the state of Internet performance in 2015. We went through our archives to find the most impactful application and network outages and selected eight whose effects reverberated through many services, users and geographies.
The most impactful Internet outages of 2015 affected entire platforms or ecosystems, setting off a chain of consequences that few would have anticipated. The events ranged from a slew of Facebook outages that affected dependent services to a route leak that took down many sites hosted on AWS. These outages exposed the fact that seemingly unrelated businesses often share common vulnerabilities and services that prove to be critical to their site availability. Shared platforms, hosting services, DNS servers and even physical infrastructure all contribute to the interconnected, collective fragility of the Internet.
As you’ll see from this analysis, it’s imperative that IT operations teams understand how interrelated their systems truly are. Because sometimes, knocking over one domino will bring the rest down.
Facebook (January 26th and September 17th, 24th, 28th)
In the past year, Facebook has had several major outages — one in January and three in September — that impacted a wide variety of services that use Facebook’s identity and authentication platform. In the late evening on January 26th, Facebook went down for about an hour, taking Facebook-owned Instagram, Tinder and Hipchat with it. Most likely, Facebook’s engineering team shut down traffic after discovering a problem in their configuration systems. In the surprising slew of outages in September ranging from 10 to 100 minutes, Facebook faced a variety of issues, from application issues on Facebook’s end to internal network errors, that ultimately also impacted third-party applications and websites in industries from retail to gaming and entertainment.
One of the most significant repercussions of Facebook’s outages was that applications that rely on Facebook to authenticate, sign up and log in were no longer able to do so. When services rely on data provided by Facebook, they make a bet on the availability and longevity of Facebook’s services, perhaps without fully considering the possibility of extended outages totally outside their control.
Amazon Web Services (June 30th and September 20th)
AWS hosts a wide range of applications created by more than 1 million active customers. So when AWS has issues, entire swaths of online services crawl to a halt.
On June 30th, AWS was the victim of a route leak which diverted traffic from major networks such as Cogent and Zayo to a Boston-based hosting provider for 42 minutes. The route leak impacted applications such as Netflix, Tinder, Amazon.com, Yelp, Jobvite, Experian and Zions Bank. Many mistakenly believed that the AWS outage was due to major fiber cuts that occurred in California earlier that day and focused their efforts in entirely the wrong place — a mistake that could have been avoided with good visibility into the situation.
On September 20th, a different issue occurred that again brought down a diverse range of services. This issue, caused by errors in AWS’ DynamoDB database service that impacted many other AWS services, made many applications unusable for hours. This second outage affected many of the same AWS-hosted services such as Netflix and Tinder, as well as IMDB, Reddit and Airbnb.
In both cases, seemingly unrelated applications all failed at once. Here at ThousandEyes, our internal chat system went down at the same time as some of our development environments and our blog.
Level 3 Communications (June 12th)
It’s not just hosting providers that can have widespread impacts; outages at large ISPs also affect a wide range of services. On June 12th, Level 3, a Tier 1 ISP, was affected by a route leak initiated by Malaysia Telekom. Over the course of two hours, services from Capital One, Google, Microsoft, LinkedIn, AOL, Reddit and Dow Jones were unreachable by many users.
This event again had the rumor mill buzzing about a series of large-scale DDoS attacks, but in the end many unrelated services were actually related in their use of critical infrastructure and service providers. While it’s hard to avoid situations where multiple services rely on a major ISP like Level 3, knowing which critical services traverse certain networks can help speed up issue identification.
UltraDNS (October 15th)
There are other common links in the Internet delivery chain. One very important aspect is the DNS hosting providers that provide authoritative nameservers for their customers. On October 15th, UltraDNS had a 2.5 hour outage caused by configuration errors. The outage impacted dozens of high-profile customers including Netflix, Expedia, Ameritrade, eTrade, Pornhub, Uber and Zions Bank.
Diversifying DNS hosting providers is something any company can do. You can typically specify four or more nameservers for a domain. For major domains, you should choose two different providers to reduce the risk of access to nameservers failing all at once.
Apple App Store (March 11)
Caused by an internal DNS error, Apple’s outage on March 11th affected the iTunes Store, iBooks Store, App Store, Mac App Store and iCloud. Many of iTunes’ 500 million users were unable to buy apps or other content, and in some cases were prevented from downloading updates and even opening already-purchased apps. The App Stores in the US, UK, Australia, Canada, many European countries, some African countries and some Asian countries were affected for up to 11 hours.
Apple is estimated to have lost around $2.2 million in revenue for every hour its various stores were down, with developers losing $1.1 million of that sum per hour. In total, Apple lost around $25 million in revenue. And the outage didn’t just affect Apple — when its stores went down, a huge number of people were affected, including the developers and companies who made games and apps like Clash of Clans and Candy Crush, the musicians and labels who produced the music for sale on iTunes, and the authors and publishers who published the books sold on iBooks. Apple’s stores have grown into large ecosystems of businesses both large and small, so when these stores experience outages, entire markets go dark.
Tata Communications (February 25th)
Sometimes services fail because they share a common physical infrastructure — a cable or router, for instance — that breaks or is faulty. On February 25th, Tata reported that a cable running between India and Singapore was cut, affecting nearly 350 gigabits per second of capacity. The outage lasted for four hours and affected many different services in India whose traffic traveled over that same cable, including Box, Docusign, Facebook, Oracle and Salesforce.
Unlike in North America or Europe where there are typically many fiber links between markets, there are much fewer submarine cables to certain regions of the world. For companies serving Asian, South American or African markets, be aware of your transit providers and whether they have sufficient capacity to failover to backup or redundant links. Otherwise, you may have a host of unhappy users when these submarine cables inevitably break.
NYSE, United, WSJ (July 8)
Though we’ve encountered outages of a variety of disparate services with the same root cause, it’s important to remember that this isn’t always the case. Case in point: on the day the New York Stock Exchange, United Airlines and the Wall Street Journal all went down simultaneously, many people jumped to the conclusion that it was the result of a concerted hack or attack. But the three outages were actually entirely unrelated: United had a router issue, NYSE blamed a software update, and WSJ’s servers were most likely overloaded with traffic.
Over the past ten years, there has been tremendous innovation on the application stack. But think about what the web experience was like ten years ago, and then consider what interactive applications are like today: from Google Docs to Netflix and WhatsApp, all of these applications still rely on the same Internet architecture that existed a decade ago — an architecture that was not built for today’s connected world. As our list of outages demonstrate, often people don’t know where common failure points are, or realize how interconnected our Internet really is, until things break. For example, if an ISP in Lithuania announces the address block belonging to Netflix, a customer in Dallas won’t be able to, say, watch a movie on Netflix hosted in a San Francisco data center. It’s critically important that IT Operations teams appreciate all the complexity and moving parts behind that seamless Netflix viewing experience, or that convenient Tinder login with Facebook.
As the ecosystem of applications, services and critical infrastructure becomes more and more interconnected, outages can cascade to affect increasingly large portions of business users’ or consumers’ daily lives. IT teams should review the incidents from 2015, take steps to identify all possible breaking points in your own and other critical services, and ensure that useful data and insights are on hand so that teams can act quickly during crises. Common points of failure could be major ISPs, a DNS provider, a hosting or IaaS vendor, or APIs that are used across a wide variety of applications. Going into 2016, you’ll be set to reduce the impact of inevitable Internet outages on your organization.