ThousandEyes has published the third edition of the Cloud Performance Report, where we examine and compare performance data and connectivity architectures of the top three public cloud providers: Amazon Web Services (AWS), Microsoft Azure, and Google Cloud. The Cloud Performance Report provides a measurement-based comparison and analysis based on network metrics and mappings collected over a three-year period, and it provides an agnostic view into the performance and behaviors of the major public cloud providers.
Cloud consumption has continued to accelerate exponentially ever since the previous edition of this report was published in 2019. Cloud-based workloads support an enormous scope of digital services today and, when performance degradations occur, ripple effects can impact more services and service dependencies than ever before. Visibility is the essential ingredient to help IT Operations (ITOps) teams understand the when, why, and where of service degradations in a world of increasingly distributed, API-centric, and cloud-dependent applications. The data in the Cloud Performance Report is, therefore, intended as a blueprint to help enterprises get visibility specific to their own cloud deployments and dependencies and is not intended to recommend any provider over another.
To put it quite simply: IT teams can make better cloud decisions by understanding the specifics of cloud provider network behaviors and anomalies.
The Cloud Performance Report shines a light on how providers manage their networks, what good (and “not so good”) network performance looks like, helping to answer critical strategic and tactical questions, such as:
- What visibility points will help me know what performance and quality I’m getting with my cloud services?
- How can I confidently plan the deployment of resilient cloud-based application stacks?
- How can I best optimize my applications given my provider’s specific connectivity and behaviors?
- What questions should I ask my cloud provider to help me ensure good performance and grow as I need to?
State of Today’s Cloud Landscape
Cloud services have become ubiquitous for enterprises today, with public cloud networks a critical piece of their day-to-day infrastructure. Business SaaS has seen widespread adoption by enterprises, adding additional layers of complexity for ITOps teams troubleshooting problems. This complex web of interdependencies can make it difficult for IT teams attempting to establish reliable evaluation, monitoring, and optimization strategies that are specific to their own applications. Another factor is the considerable centralization in the usage of cloud services. When there is a cloud outage, it can have wide-ranging impacts and be felt by many. And yet, it can still be incredibly difficult for ITOps or Site Reliability Engineer (SRE) teams to efficiently pinpoint the cause. Further complicating matters are organizations’ frequent use of multiple public clouds or a mix of public and private clouds.
IT architects need to know what they can expect from their cloud provider. While no provider is immune to anomalies, the question remains: where do these performance anomalies exist, and why? The typical deployment today of modular and distributed cloud-based digital services necessitates knowing the behavior and connectivity from many different angles. Architects invariably need to know answers to such questions as:
- What does the cloud provider’s network connectivity look like, and how does it perform for my scenario?
- How well-peered is my provider with other cloud and transit providers?
- And how well does my provider perform between their cloud regions and between their availability zones?
Here are some highlights from our findings, the full details of which can be found in the Cloud Performance Report.
Finding 1: Connectivity architecture decisions made by cloud providers can have performance and operational impacts for their customers. Connectivity architectures varied between the three major cloud providers. This includes differences in how they advertise service endpoints, how they obscure underlay paths, and how they leverage shared infrastructure for their backbone. These differences can have meaningful impacts for their customers.
Finding 2: Cloud regions in mature markets see good backbone performance, whereas other regions of the world, such as Asia, saw more issues. All three providers made notable optimizations to their backbone performance since 2019; although, we continued to see significant latency variations.
Finding 3: All cloud providers experience performance challenges with traffic originating from users within mainland China. This is due to the performance hit from traversing China’s Great Firewall, which results in packet loss and higher latency. Hong Kong still appeared to be outside of China’s Great Firewall; although, packet loss to Hong Kong did increase significantly starting in 2021.
Finding 4: Inter-AZ performance was very good for all three providers, with most regions showing latency well below the desired 2 milliseconds threshold. Notably, some providers were seen to be more consistently under that latency threshold than others.
Finding 5: Traffic flowing between the major cloud providers was typically handed off directly, bypassing the Internet. This finding speaks to how well-connected the major cloud providers are to each other, and how, in some cases, cloud-to-cloud performance rivaled intra-cloud performance for similarly located regions.
Read the Cloud Performance Report to learn more.
Architecting for Performance in the Cloud
Cloud performance matters because today’s application designs have become so reliant on it. Modular app stacks have made low latency a must-have. Cloud is at the center of the web of distributed application interdependencies, microservices, and SaaS APIs that drive digital services. On top of this, architects strive to design services to be highly available, resilient, and cost efficient. High availability goals drive multi-instance load-balanced application stacks, geo-redundant data replication designs, and multi-region architectures.
Cloud network performance, therefore, is not measured by a single metric but by looking at several different data points collected from multiple perspectives. Inter-region connectivity, for example, can have wide variations depending on the network metric, provider, and geography. Knowing the performance of the relevant interconnections is paramount when planning new application deployments.
The data set used in the Cloud Performance Report includes metrics of loss, latency, jitter, MTU, and forward and reverse path topology data for end-user, inter-region, inter-availability zone, and multi-cloud measurements. These four categories of measurements comprise the different use cases that affect consumers and operators of cloud-based applications.
The public Internet can play a big role in cloud-based application performance. End-user measurements are designed to give customers of cloud IaaS and platform services insight into how different cloud provider locations are connected to the broader Internet and how the end-to-end paths perform for different locations.
Architects deploying new services will have questions that can be addressed by this visibility, such as: how long does traffic stay on the public Internet before it enters the cloud provider network, and do longer Internet paths affect overall performance? Cloud providers are constantly improving their backbones and their peering, but there are regional differences in performance among the major cloud providers. Knowing these details can help inform successful application planning and deployment decisions.
Inter-availability zone (AZ) measurements were collected across all three analyzed cloud providers. Multi-zone cloud application architecture is typically used for resiliency. Application architects will typically deploy their app stacks in highly redundant, load-balanced designs split across different physical availability zones. If one AZ sees a failure, the application can remain available. For example, a typical active-active application design may include multiple instances of the same application stack split across different availability zones, with data synchronization occurring in real-time between the instances. In this scenario, every millisecond counts because latency can stack up over the course of an application session.
Providers commonly strive for response times below 2 milliseconds between zones, but there can be variability. ThousandEyes data was used to gauge the number and nature of such variability or anomalies across each analyzed cloud provider. Variability was seen not only in the latency numbers themselves, but also in characteristics like frequency and duration.
Multi-region application architectures are primarily used for latency concerns. In other words, deploying applications and content closer to the user improves the user’s experience of that application. When architecting back-end services closer to front-end services, and synchronizing the data between the regions, application latency can be reduced.
Organizations can have other valid business policy reasons for using multi-region connectivity beyond technical use cases. For example, there may be a requirement to deploy active-standby geographically redundant application pods, or there may be a need to store customer data in one geographic region but not in another.
Cloud provider backbone performance is absolutely critical in these scenarios. Our analysis found that cloud regions in more mature markets saw reliable backbone performance, whereas other regions (notably, those in Asia and Oceania) were less reliable. Analysis of this data set uncovered that the cloud providers made some optimizations in different regions over the three-year period and that latency fluctuations are frequently seen.
Modern applications today often rely on multiple public or private clouds, either by purposeful design or as a result of third-party service dependencies sitting in different cloud provider networks. Applications using modular frameworks are API-centric, meaning that API-to-API communications are a typical operation in an application flow. If an API in one cloud provider is talking with an API in another cloud provider, it’s important to know what that network connectivity looks like and how well it performs.
When planning deployments using cloud services, teams may need to know if one pair of cloud providers has better interconnectivity than another pair of cloud providers for their specific locations or if inter-region latencies between different providers meet their needs. Our data reveals that traffic going from one cloud provider to another is typically handed off directly without traversing the public Internet, showing how well-peered the major cloud providers are. This interconnectivity can provide performance benefits for multi-cloud traffic.
Our analysis of the collected cloud data has revealed three key insights that Infrastructure and Operations (I&O) professionals should be mindful of when planning and managing cloud deployments and dependencies.
Performance issues with cloud services are not uncommon. Cloud providers are constantly working to scale their presence and expand global capabilities. Routine maintenance is a constant fact of life, and no provider is immune to issues. While major outages make headlines, the more frequent, smaller-scale performance and availability issues can be difficult to catch and identify—which, in turn, cause considerable impacts to user experience. Being ready for issues of all sizes should be part of every team’s cloud management strategy.
Cloud providers manage their networks based on their own priorities and preferences. How a cloud provider designs and scales its network may not align with every customers’ use case. Providers will vary in how they optimize and prioritize traffic across what are often shared networks. IT leaders need to know where they stand in relation to these preferences and prioritizations and whether they could be impacted.
There is no steady state in the cloud. Cloud networks are in constant flux as providers strive to scale and expand their infrastructure and add new locations, services, and connectivities. How one region performed for a provider one year may differ considerably from the next year. Knowing that these networks are dynamic and ever-changing helps to inform operational strategy. Likewise, performance snapshots may not reflect current conditions, so having persistent and ongoing visibility is critical.