Announcement
New Innovations: Cisco Secure Access Integration, Enhanced API Monitoring and more

Outage Analyses

AWS Outage Analysis: June 13, 2023

By Kemal Sanjta
| | 6 min read

Summary

On June 13, 2023, Amazon Web Services (AWS) experienced an incident that impacted a number of services in the US-EAST-1 region. Read the full outage analysis and key takeaways.


ThousandEyes actively monitors the reachability and performance of thousands of services and networks across the global Internet, which we use to analyze outages and other incidents. The following analysis of AWS’ US-EAST-1 service disruption on June 13, 2023, is based on our extensive monitoring, as well as ThousandEyes’ global outage detection product, Internet Insights.


Outage Analysis

On June 13, 2023, Amazon Web Services (AWS) experienced an incident that impacted a number of services in the US-EAST-1 region. The incident, which lasted more than 2 hours, was first detected around 18:50 UTC, when ThousandEyes observed an increase in latency, server timeouts, and HTTP server errors impacting the availability of applications hosted within AWS. The issue was mostly resolved by 20:40 UTC, with availability returning to normal levels for a majority of impacted AWS services, as well as subsequently affected applications. 

You can explore the outage within the ThousandEyes platform here (no login required).
Screenshot of ThousandEyes showing global location unable to access an AWS-hosted application
Figure 1. Global locations failing to access an application hosted within AWS

During the incident, ThousandEyes did not observe any significant issues, such as high latency or packet loss, for network paths to AWS’ servers, as Figure 2 shows.

Screenshot of ThousandEyes showing network paths from global locations
Figure 2. Network paths from global locations show no packet loss or latency issues

However, the incident appears to have manifested as elevated response times, timeouts, and HTTP 5XX server errors for users attempting to access impacted applications (see figure 3). 

Screenshot of ThousandEyes showing HTTP server errors
Figure 3. HTTP server errors indicate internal application issue

The HTTP 5XX server errors, as well as the receive timeouts ThousandEyes observed, point to an application issue that was likely related to a backend process. The applications that were impacted during the incident appeared to experience issues regardless of where the frontend web servers were located. However, the simultaneous failure conditions ThousandEyes detected in the US-EAST-1 region suggest there could be a potential point of failure for applications leveraging AWS services in that region. 

Screenshot of ThousandEyes showing HTTP 502 errors
Figure 4. HTTP 502 errors returned when trying to reach US-EAST-1 management console  

Similar to the network conditions at the AWS edge servers, there was no apparent network degradation within the US-EAST-1 region at the time of the incident (see figure 5).

Screenshot of ThousandEyes showing application availability
Figure 5. Application availability unimpacted by network conditions

Approximately 20 minutes after the start of the incident, at 19:08 UTC, AWS reported that they were investigating a service issue in that region. At 19:26 UTC, AWS identified the source of the issue as a capacity management subsystem located in US-EAST-1 that was impacting the availability of over 104 of its services, including Lambda, API Gateway, AWS Management Console, Global Accelerator, and others. These affected services were experiencing elevated error rates and increased latencies. Subsequently, applications leveraging these services, regardless of where they were hosted or where they were serving users, would have experienced similar impacts in their own service availability. The issue was eventually resolved around 20:40 UTC, with availability returning to pre-incident levels, as figure 6 shows.

Screenshot of ThousandEyes showing AWS availability restored
Figure 6. Availability restored to normal levels.  

Approximately 20 minutes later at 21:00 UTC, another service disruption impacted some applications hosted in AWS for several minutes; however, this disruption appeared to be unrelated to the earlier ~2 hour incident.


Lessons and Takeaways

This incident illustrates the complex web of interdependencies that applications and services rely on today. Many of these dependencies may be indirect, or “hidden,” from the organizations, as they may be dependencies of the services they are directly consuming. In particular, many services offered by cloud providers such as AWS have fundamental architectural dependencies on one another. Organizations leveraging cloud services, such as those offered by AWS, should be aware of the relationships in their digital ecosystem, regardless of whether those relationships are services or networks. 

Visibility is key to understanding dependencies and potential points of failure. Not every potential failure point in your service architecture is avoidable; however, being aware of your vulnerabilities and creating mitigation strategies in advance can enable you to minimize impact. The ability to quickly detect when these mitigation mechanisms are required is also critical to maintaining high availability—as is it for accurate attribution of issues. 


To learn more about how you can get granular insight into your service dependencies and continuously monitor your availability, sign up for a ThousandEyes free trial.

Subscribe to the ThousandEyes Blog

Stay connected with blog updates and outage reports delivered while they're still fresh.

Upgrade your browser to view our website properly.

Please download the latest version of Chrome, Firefox or Microsoft Edge.

More detail