Top Internet and Cloud Outages of 2017

Now that we have rung in the new year and settled into our daily routine, let’s take a few minutes to reflect on the state of the Internet and key outages that shook the world in 2017. Over the past few years, as we analyze many such outage events, we’ve noticed an undeniable trend that governs today’s cloud era: The Internet is the backbone of human communication. We start and end our day with it. Be it calling a ride to work or ordering dinner on the way back, the Internet has a part to play in our quality of life. Which is why when it breaks, it causes havoc.

At ThousandEyes, we dedicate a lot of time understanding and troubleshooting these Internet outages. As we look back on the 2017 outages, we noticed four clear patterns:

Enterprises are heavily moving to the public cloud and adopting a hybrid WAN architecture. However, they are seldom aware and blindsided by the dependencies that arise while relying on IaaS providers like AWS or Azure.
BGP Route leaks, malicious or unintentional, can cause world-wide disruptions in the Internet. The number of notable incidents that were triggered due to BGP mishaps seemed significantly higher this year compared to last.
DNS remains a critical enabler to digital businesses. Be it the 2016 Dyn DDoS attack or this year’s Marketo DNS expiry, when DNS goes unattended it can cause a significantly large impact on revenue and reputation.
Popular services fail to anticipate demand and fall short with network and capacity planning.

Figure 1: Takeaways from the most significant outages of 2017.

AWS S3 Breaks the Cloud (February 28th)

On Tuesday, February 28th, someone fat-fingered a configuration change and brought AWS’ S3 cloud object storage service to the ground. In addition to S3, a host of other AWS services that rely on S3 like Regional Database Service, Elastic Load Balancers, and Redshift data warehouse, also went dark. So while use of the S3 service often occurs on the back-end and is not readily apparent to end users, the outage revealed that many of AWS’ other services have dependencies on S3. This cloud outage exposed a critical lack of redundancy in many services’ cloud storage solutions. When relying on AWS to host your critical services, enterprises need to keep in mind the dependencies on both internal and external services.

Figure 2: Starting at 9:40am PST, the availability of the S3 services immediately dropped from normal levels down to 0%.
At the same time, packet loss also immediately increased to 100%.

Takeaway: Monitoring your cloud service providers like AWS and Azure in depth can help you understand dependencies that exist within those providers and help you reduce the time needed to troubleshoot and resolve issues with their roots in the cloud. Pay attention to end-user experience and connectivity into these cloud infrastructures. Focus on understanding what dependencies exist when you utilize multiple services within your cloud vendor. For example, if you are an AWS shop, understand how your EC2, S3 and Redshift instances are interconnected and what inter-dependencies exist across AWS zones and AZs. If you want to learn more about monitoring your services in AWS, watch our webinar.

Russian Route Leaks (April 26th, December 12th)

This year, the hijacking and leaking of BGP routes by Russian networks and autonomous systems generated news and caused a disturbance in the Internet force. Most recently the morning of December 12th, a relatively obscure and previously unused autonomous system (AS39523) belonging to a Russian Internet provider DV-LINK-AS, hijacked the routes of organizations such as Google, Apple, Facebook, and Microsoft.

However, this wasn’t the first time that Russia made waves across the Internet. On Wednesday April 26th at 22:36-22:43 UTC, Rostelecom, a Russian (partially state-owned) telecommunications company, impacted traffic destined for e-commerce and payment processing services from financial services firms, as well as web security and Internet security offerings. Rostelecom originated (advertised that they were the proper destination for) 137 prefixes, or segments of the IP address space that belonged to companies such as Symantec, Visa, Mastercard, Fortis and EMC. This hijack specifically targeted the one or two prefixes that serve predominantly e-commerce and payment processing services. For example, not all of Mastercard’s prefixes and traffic were hijacked, just those tied to their payment processing for services such as SecureCode.

Figure 3: Internet service providers such as Telstra, Level 3, Tinet and Hurricane Electric accepted the hijacked routes from Rostelecom.

The data, some of which can be seen in our visualization (Figure 3), shows which Internet service providers accepted the hijacked routes from Rostelecom. Some accepted the routes like Telstra, Level 3, Tinet and Hurricane Electric did, while others such as Qwest, NTT and AT&T did not. A multitude of factors determine whether a network accepts, prefers and then broadcasts a route onto its own peers. Ultimately, it's a matter of trust and preferences for shorter routes and/or preferred peering and business partners.

Level 3 Leaks Comcast Prefixes (November 6th)

On Monday morning, November 6th 2017, from 9:45am-11:25am Pacific, Comcast suffered a nationwide outage sending several million users of the popular Internet Service Provider into a frenzy. Our analysis has revealed that at 9:30am Pacific, Level 3 leaked more than a thousand specific prefixes of Comcast subsidiary networks and their customers. Level 3 declared to the Internet that it was the best way (AS Path) to get to these prefixes, forcing traffic for all of these networks to transit through Level 3 instead of the Comcast backbone. At 11:25 am Pacific, Level 3 withdrew these leaked routes, re-establishing the peace and quiet of the Internet.

Takeaway: Whether they’re unintentional route leaks or malicious hijacks, BGP routing issues can cause widespread outages, damage the user-experience of applications that increasingly rely on the Internet, and harm your digital brand reputation. Route leaks are yet another reminder of the vulnerability and the fragility of the Internet. Apart from performance monitoring, enterprises should adopt a monitoring strategy that encompasses BGP monitoring and detecting route leaks to their business critical applications.

Neglected DNS Bites Marketo (July 25th)

On Tuesday, July 25th, 2017 at 4:25am Pacific, all services related to Marketo’s domain “marketo.com” began experiencing an outage. Marketo is a marketing automation software used by companies globally. The impact of the Marketo outage was widespread and users were unable to access the platform or even email Marketo employees, as the outage impacted their mail servers too.

We began noticing the outage when our tests to https://app.marketo.com alerted us to poor availability. Application availability dipped down to levels around 60-70%, while packet loss also spiked to around 20%. Digging deeper, we noticed that some traffic was reaching Marketo’s CDN provider, Akamai, while packet loss was occuring for some traffic that was strangely destined for IP address 208.91.197.132 belonging to Confluence Networks (Figure 4), based on WHOIS records.

Figure 4: Traces are terminating on their way to the same IP address, 208.91.197.132, which belongs to Confluence Networks.
The other test traces go to Akamai, Marketo’s CDN provider.

Our analysis confirmed that Marketo’s failure to renew its domain name had triggered the outage. When the domain name expired, traffic to Marketo was redirected to an IP address in another provider called Confluence Networks, effectively blackholing that traffic, subsequently resulting in application unavailability.

Takeaway: DNS remains a critical enabler to digital business. Given the trend toward more distributed and dynamic applications, DNS performance is more critical than ever. If you are still making those new year resolutions, don’t forget to add monitoring all third party services and applications, like your DNS providers or CDN vendor.

Traffic Spikes Crash Cambridge Website (October 24th)

On Oct. 23, Hawking’s Ph.D thesis went live on the University of Cambridge website. Within 24 hours of the release, no one could access it. The release of the paper, timed with Open Access Week 2017 (a worldwide event aimed at promotion free and open access to scholarly research) significantly increased the inbound traffic to the website and ultimately crashing it. According to a Cambridge spokesperson, the website received nearly 60,000 download requests in less than 24 hours, causing a shutdown of the page, slower runtimes, and inaccessible content for users.

Takeaway: Benchmarking and capacity planning is critical for network operations. Best practices include testing your network prior to new software updates and large-scale events. Bolster your network architecture through CDN vendors and anycast architectures to maximize user-experience. And monitor to make sure your vendors are performing as promised.

A Brighter Future

To be perfectly honest, the Internet is never going to be outage free. We are going to see more of these outages in 2018. It is however, possible to be better prepared when these massive outages occur. As your cloud-first initiatives drive the adoption of the Internet as the enterprise communication backbone, having complete visibility into all aspects of your business delivery is critical. Be aware of the inherent dependencies as you build a cloud-centric ecosystem of applications and services. Monitor every dependency — from your ISPs to DNS providers to your CDN vendors. Baseline your network and establish the parameters of the new normal. By being aware of what’s going on, you can be prepared to react faster when disaster strikes. If you want to be smarter about navigating the inevitable Internet outages in 2018, sign up for a free trial.

Outage Analyses

Top Internet and Cloud Outages of 2017