Amazon AWS Outage a Lesson in Managing Cloud First Risks

On Friday, March 2nd, 2018, a power outage hit the AWS-East Region (Ashburn), affecting hundreds of critical enterprise services like Atlassian, Slack and Twilio. Significant corporate websites and Amazon’s own service offerings were impacted as well.

ThousandEyes operates a global set of software agents performing Internet-aware network monitoring. We were able to catch this outage in the act, which rippled very quickly outward towards other services. Even customers relying on Amazon private connectivity service known as AWS Direct connect were affected. Today’s outage lands on the first anniversary of another massive AWS S3 outage and is a hard-hitting reminder of the vulnerability of the cloud. Enterprises are very quickly adopting Cloud First strategies by moving workloads to IaaS providers like AWS. However, many organizations still do not fully comprehend the unpredictable dependencies that go along with this shift.

Before we delve into the details of how it went down, here is a quick look at the impact and the severity of the AWS outage. ThousandEyes monitors critical services across the Internet from multiple vantage points and algorithmically correlates data to understand service impacts. Over 240 critical services relying on AWS infrastructure felt the impact of today’s outage. AWS-East is one of the first AWS regions and hands down their largest, with at least with 6 Availability Zones (AZ). The AWS-East region is located in Ashburn, VA and serves as the hub of connectivity for Internet and cloud providers. What started as a power outage impacting a small set of services quickly cascaded into a major issue even impacting customers who had subscribed to Amazon’s critical service offering, AWS Direct Connect.

Figure 1: Outages concentrated around the AWS Ashburn location.

Wake Up Call

On March 2nd, 6:20am PST, we were alerted to service interruptions to AWS EC2 endpoints located within the AWS-East (us-east-1) region. At first glance, it seemed like a harmless intermittent issue with a small spike in packet loss within AWS. Follow along with an interactive sharelink.

However, the BGP AS path change coinciding with the packet loss made us suspicious. Based on the signatures of the recent GitHub DDoS attack, we suspected these signals to be a prelude to another DDoS attack but were quickly proved wrong. The BGP event was short-lived, with AS paths returning to steady-state conditions, indicating a route flap rather than an intentional path change. At 8:10am PST, we noticed the above pattern repeat, with loss in service availability and increase in packet loss within the AWS network. AWS later confirmed that the blip in connectivity was related to a power outage issue in their US-East-1 Region. Based on the time of the event, we believe that the power outage contributed to the BGP flaps and service disruption.

Figure 2: AWS confirms power outage.

Falling Like Dominoes

At the exact same time (6:25am PST) we were alerted to another service disruption. Engineers at ThousandEyes were unable to access Atlassian’s Jira, the issue management platform. We quickly directed our attention to ThousandEyes internal tests monitoring Atlassian. Global service availability had dropped down by 70% as seen in Figure 3.

Figure 3: Atlassian services down at the exact same time AWS US-East suffered a power outage.

AWS EC2 service interruptions and the Atlassian’s outage might have seemed like two unrelated but coincidental events, but we knew that was too good to be true. We started digging into Path Visualization and quickly connected the dots. Atlassian uses AWS to host its services, that explains the coincidental behavior. As seen in Figure 4 below, Atlassian services were hosted in the Ashburn AWS data center and experiencing 100% packet loss within the AWS network.

Atlassian Path Visualization packet loss

Figure 4: 100% packet loss to Atlassian services hosted at AWS.

Atlassian was just one of the many services impacted today. We noticed that services that relied on multiple AWS regions had relatively lower impact than services that solely relied on AWS-East. While depending on external service providers like AWS, enterprises should be conscious of redundant architectures and the impact of such architectures on user experience.

Takeaways for Going Cloud First

This episode serves as a powerful reminder that the cloud is a complex interconnected system. Outages and natural disasters in one part of the cloud can quickly ripple over into other areas. Cloud vendors offer several ways to directly connect into their infrastructure. However, they do not make you immune from the external dependencies of the Internet. While availability zones offer some level of redundancy, regional outages like these can quickly envelope entire clusters of data centers.

So what can we do to mitigate against this? Consider geographical redundancy as a key part of your fault tolerance strategy. While we saw a number of services that went out cold, we also saw several that experienced a short-lived blip and successfully recovered from the outages. Make sure your workloads are not concentrated in one geographic region which may be vulnerable to the same shared risk. Be mindful of the inter-region latencies as microservices apps make API calls between the various components. Also, monitor connectivity to your cloud infrastructure and services so you can correctly identify the scope and root cause of service outages. The key to successful recovery is understanding the actual source of the outage. If you rely on AWS for your cloud needs, we strongly urge you to watch our webinar on Monitoring Connectivity to AWS for best practices and tips.

Talk to ThousandEyes for battle-tested best practice recommendations for monitoring the cloud. Of course, if you already know that you need this level of cloud visibility, request a demo or start a free trial.

Outage Analyses

Amazon AWS Outage a Lesson in Managing Cloud First Risks