As we sip on warm holiday beverages and gradually wind down the year, it’s rather customary to reflect on the past and contemplate the future. Following traditions, we took a stroll down memory lane analyzing the state of the Internet and the “cloud”. In today’s blog post we discuss the most impactful outages of 2016, understand common trends and evaluate some of the key learnings.
As we analyzed the outages that have befallen this year, we noticed four clear patterns emerge.
- DDoS attacks took center stage and clearly dominated this past year. While the intensity and frequency of DDoS attacks have been increasing over time, the ones that plagued 2016 exposed the vulnerability of the Internet and the dependency on critical infrastructure like DNS.
- Popular services weren’t ready for a crush of visitors. Network and capacity planning is critical to address the needs of the business during elevated traffic patterns. For example, lack of a CDN frontend can prove to be costly if not factored into the network architecture during peak load.
- Infrastructure redundancy is critical. Enterprises spend considerable time and money focusing on internal data center and link-level failure. However, there is often oversight when it comes to external services and vendors.
- The Internet is fragile and needs to be handled with care. Cable-cuts, routing misconfigurations can have global impact and result in service instabilities and blackholed traffic.
DDoS Attacks: Hitting where it Hurts
While there were a plethora of DDoS attacks in all shapes and forms, four different attacks had the highest impact. Three out of the four attacks targeted DNS infrastructure.
On May 16th, NS1, a cloud-based DNS provider was a victim of a DDoS attack in Europe and North America. Enterprises relying on NS1 for DNS services like Yelp and Alexa were severely impacted. While this started out as an attack on DNS, it slowly spread to NS1's online-facing assets and their website hosting provider.
The second attack on June 25th, came in the form of 10 million packets per second, targeting all 13 of the DNS root servers. It was a large-scale attack on the most critical part of the internet infrastructure and resulted in roughly 3 hours of performance issues. Even though all the 13 root servers were impacted, we noticed varying levels of impact intensity and resilience. There was a strong correlation between the anycast DNS architecture and the impact of the attack. Root servers with greater anycast locations saw diluted attack traffic and were relatively more stable than root servers with fewer locations.
The mother of all DNS DDoS attacks was single-handedly responsible for bringing down SaaS companies, social networks, media, gaming, music and consumer products. On October 21st, a series of three large-scale attacks were triggered against Dyn, a managed DNS provider. The attack impacted over 1200 domains that our customers were monitoring and had global reach, with heavy effects in North America and Europe. We saw impacts on 17 of the 20 Dyn data centers around the world for both free and paid managed DNS services. Customers who relied only on Dyn for DNS services were vulnerable and severely impacted, but those who load-balanced their DNS name servers across multiple providers had the luxury to fall back on the secondary vendor during the DDoS attack. For example, Amazon.com had multiple DNS providers: Ultra DNS and Dyn. As a result, it did not suffer the same unavailability issues as many of Dyn’s other customers.
The DDoS attack on the Krebs on Security website on September 13th was record-breaking in terms of the size of the attack, peaking at 555 Gbps. Both the Dyn and the Krebs attacks were triggered by the Mirai botnet of hacked consumer devices. While the Internet-of-Things is set to revolutionize the networking industry, security needs to be top-of-mind.
Targeting critical infrastructure, like DNS, is an efficient attack strategy. The Internet, for the most part runs like a well-oiled machine; however, incidents like this present a reality check on network architecture and monitoring mechanisms. Consider monitoring not just your online-facing assets but also any critical service, like DNS. Be alerted as soon as you start seeing instabilities in the network to trigger the right mitigation strategy for your environment.
Application Popularity: Overloaded Networks
When it comes to application usage and websites there is no such thing as too many visitors. Until the network underneath begins to collapse. 2016 witnessed some popular services unable to keep up with demand.
January 13th witnessed one of the largest lottery jackpots in U.S history. Unfortunately, it also witnessed the crumbling of Powerball, the website that serves up the jackpot estimates and winning numbers. Increased packet loss and extended page load times indicated that neither the network or the application could handle the uptick in traffic. In an attempt to recover, Powerball introduced Verizon’s Edgecast CDN network right around the time of the drawing. Traffic was distributed across three different data centers (Verizon Edgecast CDN, Microsoft’s data center and the Multi-State Lottery Association datacenter), but it was too late. The damage was already done and user experience to the website was sub-standard.
The summer of 2016 saw a gaming frenzy, thanks to PokemonGo. There were two separate occasions (July 16th and July 20th) when Pokemon trainers were unable to catch and train their favorite characters. The first outage, characterized by elevated packet loss for 4 hours, was a combination of the network architecture and overloaded target servers unable to handle the uptick in traffic. The second worldwide outage was caused by a software update resulting in user login issue and incomplete game content.
November 8th was a defining moment in global politics. It was also the day the Canadian Immigration webpage was brought down by scores of frantic Americans. As US states closed the presidential polls and results began trickling in, the immigration website started choking before finally giving up. We noticed 94% packet loss at one of the upstream ISP providers, an indication that the network could not keep up with the spike in traffic.
Benchmarking and capacity planning is critical for network operations. Best practices include testing your network prior to new software updates and large-scale events. Bolster your network architecture through CDN vendors and anycast architectures to maximize user-experience. Monitor to make sure your vendors are performing as promised.
Fragile Infrastructure: Cable Cuts and Routing Outages
The network is not free from those occasional cable cuts and user induced misconfigurations. Let’s see how, sometimes a simple user oversight can impact services even across geographical boundaries.
On April 22nd, AWS experienced route leaks when more specific /21 prefixes were advertised by Innofield (AS 200759) as belonging to a private AS and propagated through Hurricane Electric. This resulted in all of Amazon-destined traffic transiting Hurricane Electric, to be routed to the private AS rather than Amazon’s AS. While the impact of this route leak was minimal, it was rather tricky as the leaked prefixes were not the same as Amazon’s prefixes, but more specific and thus preferred over Amazon. This was no malicious act, but rather a misconfiguration on a route optimizer at Innofield.
Level 3 experienced some serious network issues across several locations in the U.S and U.K on May 3rd. The outage lasted for about an hour and took down services including Cisco, Salesforce, SAP and Viacom. We were able to trace down the issue to a possible misconfiguration or failure in one of the transcontinental links.
On May 17th, a series of network and BGP level issues were correlated to a possible cable fault in the cross-continental SEA-ME-WE-4 line. While the fault seemed to be located around the Western European region, it had ripple effects across half the globe, affecting Tata Communications in India and the TISparkle network in Latin America. While monitoring your networks, look for indicators of cable faults. Some examples include dropped BGP sessions or peering failure, multiple impacted networks with elevated loss and jitter.
On July 10th, JIRA, the SaaS-based project tracking tool was offline for about an hour. From a BGP reachability perspective, all routes to the /24 prefix for JIRA were withdraw from Level 3. This resulted in the self-adjusting routing algorithm searching for an alternate path. Unfortunately, the backup path funnelled all the traffic to the wrong destination AS. Traffic was terminating in NTT’s network instead of being routed to JIRA due to a misconfiguration of the backup prefix.
Looking Ahead
So, what have we learnt? By its very nature, it is expected that networks are bound to have outages and security threats. Smarter networks are not the ones that are built to be foolproof, but the ones that can quickly react to failure and inconsistencies. As the internet becomes the glue that binds SaaS and service delivery, it is paramount to have visibility over its shortcomings, especially during a crisis. As you march into the new year, take stock of the past year’s events and prepare for the future. Bolster your network security, but at the same time monitor how your network is performing under adverse conditions. Detect bottlenecks, common points of failure and distribute dependencies across ISPs, DNS service providers or hosting providers. Join us for a live webinar and learn how to monitor critical infrastructure as we analyze the key outages of 2016.