Race to Cloud Domination: AWS, Azure, GCP, IBM & Alibaba Cloud Compared
One year ago, we launched the first ever cloud performance report to answer this key question, how can you use performance data to make better cloud decisions?
We know that the cloud providers are in a race for market domination and we know that there is no steady state in the cloud. So it only made sense for us to follow up and rerun this research. But we still want to answer this key question, in their race to cloud domination, how can you win in the cloud?
Good morning, everybody. My name is Archana Kesavan. I'm the lead author and report researcher for the second cloud performance benchmark. It gives me immense pleasure to be here representing ThousandEyes and presenting to you the insights of this report, which is the first ever vendor agnostic data-based research, not survey-based, but data-based research that compares the performance, network performance, and connectivity architectures of the top five cloud providers.
Our journey started in 2018 and we noticed that there was a glaring gap when it came to performance comparisons of cloud providers. So we wanted to change that and we did an analysis of AWS, GCP, and Microsoft Azure in terms of network performance and connectivity architectures.
Now, a year goes by and we're still on that journey. And we're looking at performance this year from three vectors. Change-- there is no steady state in the cloud. We can now look at year-over-year change and more importantly, find out what triggered those changes.
Coverage-- we've expanded our data measurements to cover vantage points that mattered to you the most. We're looking at China deeply because we know it's a challenge geography for doing online business.
We're also looking at broadband ISP performance as they connect to the cloud because as enterprises move to SD-WAN and hybrid WAN, that becomes critical to know. And then we're looking at the AWS Global Accelerator, a new service that was introduced by AWS last year that promises improvement to performance and availability of your AWS workloads.
With multi-cloud, we know you have a lot of options to pick from. So we're adding Alibaba Cloud and IBM Cloud into our research this year as well, making this the most comprehensive cloud performance to date.
We gathered this data using ThousandEye's data infrastructure engine and analysis engine. So ThousandEye's vantage points distributed globally in about 200 cities in the world and within cloud provider environments.
Now, our agents are continuously talking to each other emulating user traffic through TCP flows, and we are periodically every 10 minutes measuring performance, network performance, which includes loss, latency, and jitter.
Now, because we have vantage points on both these ends, we can look at performance directionally. So we can look at forward path as well as reverse path, which is extremely critical in networks because there is no symmetry.
We're also using path visualization, which gives us detailed insights into network topology and connectivity architectures. And as you're going to see, holds the key to a lot of questions we're about to answer.
Now, this year, the research scope was extensive. So we started from end user measurements, we went all the way to multi cloud connectivity. So we're measuring everything to the cloud, from the cloud, and within clouds.
Let's look at end user measurements. End user measurements and how your users consume a cloud service is of utmost importance. So what we're doing here is, from 98 global locations, we are monitoring to 95 regions across all five cloud providers. We mimic this same setup with broadband ISP performance. We test from six different cities in the United States to the North American hosting centers of the five cloud providers.
Now this gets you to the front door of the service, right? We know applications are microservice-based and distributed, and are hosted in multiple availability zones or multiple regions within your cloud providers. So looking at inter-AZ performance and inter-region performance along with multicloud connectivity is absolutely critical.
All those measurements, all that data collected over 30 days at 10-minute intervals-- there was a lot of data, over 320 million data points, but now we had so many questions.
For instance, does Alibaba Cloud get a free ride through the Great Firewall? Is the AWS Global Accelerator always outperforming the internet? Or, what are the key areas where performance anomalies exist and why? But let's start with the first question today, which is, are all cloud backbones created equal?
Now, when you think about cloud backbones, there are two ways we should look at it. One is communication within the cloud, which is say an inter-AZ or an inter-region performance and communication from users. So how do user traffic get to cloud services? How does the backbone get used there, and do they get used the same way?
We're going to start with inter-region measurements. Now, when we did this research for the first time last year, we found out that AWS, Azure, and GCP extensively used their own backbone for inter-region communication.
So what we did to look at comparison measurements is we took inter-region latencies, we compared it to internet measurements between those exact same geographic regions, but outside their backbone, and then we grouped inter-region pairs, as you can see here, as doing better than the internet and doing worse than the internet.
Now, if you focus on the three vertical bars right in the middle, which is AWS, Azure, and GCP, you will see that about 90% of their regions do really well. But let's look at IBM. IBM does exceptionally well in this case, right? 97% of their inter-region pairs perform better than the internet. Only 3% actually don't perform as good as the internet.
If you move over to the right and look at Alibaba Cloud, your performance is actually the opposite of IBM. 15% of Alibaba Cloud regions do not perform as good as the internet. So they perform worse than the internet. We had to ask why.
And when we dug into it, we found out that Alibaba Cloud is the only cloud provider that does not extensively use its own backbone for inter-region communication. What does that mean? It means inter-region communication path has the internet right in the middle. So the internet is connecting various different regions of Alibaba Cloud in many cases.
As an enterprise, if you're hosting applications within this cloud provider and is dispersed across regions, this becomes critical to know because, A, it can impact performance and you want to know the different networks your traffic is going through.
Let's switch to inter-AZ performance. We measured inter-AZ performance last year across AWS, GCP, and Azure and we found out that they actually hold up really well to their claims. These three providers claim to have sub 2 millisecond inter-variability zone latencies, and it turns out they do. And year-over-year things just got better with GCP leading that.
How did IBM and Alibaba Cloud do? Well, not in the same class as the other three, but that is exceptional performance, 1.68 milliseconds and 1.22 milliseconds between availability zones.
Now, availability zones are constructed so there is redundancy and there's fault tolerance within a region, right? So if one availability zone fails, the other can take over but your application is still alive and is not disruptive to users.
When we looked at this data, well, they were all below 2 milliseconds. A couple of data points stood out for us. We noticed really low latency numbers in two regions. Each of the circles that you see there that's filled with a latency number is measurement from that particular region.
So we noticed that GCP in one of its region had an inter-AZ measurement of latency of 0.37 milliseconds and 0.27 milliseconds in the case of Alibaba Cloud. And those were really low numbers, so low that it prompted some questions. How are inter-AZ or how are availability zones constructed within a provider? Are they physically independent or are they different floors in a data center?
So while these latency numbers are good, it does beg the question if availability zones are constructed differently across these providers. So as enterprises using availability zones, using inter-regions ask your cloud provider how they are because assuming that an availability zone for AWS is the same as an availability zone for another provider is probably not right.
So we've been talking about within a cloud provider, right? We're looking at inter-region, we're looking at inter-availability zones. But let's now shift and look at, how does user traffic get to these regions?
Now, last year, we found out there are two distinct ways by which user traffic is ingested into a cloud provider's backbone. The first is an internet intensive way. What does that mean? It means user traffic rides the internet longer till it gets into the backbone of a cloud provider.
So for instance, if you have a user accessing a service in Singapore, a user from Frankfurt accessing a service in Singapore, traffic's going to ride their internet to Hong Kong in an internet intensive architecture and then go into the cloud provider's backbone.
The other method is a backbone friendly architecture where user traffic goes into the provider backbone closest to the user. So what does that mean? It rides the internet less, but it rides the backbone of a cloud provider longer.
These are two different types of connectivity, but there are three different approaches that we've seen. Backbone friendly we found out last year and it still holds true that Microsoft Azure and GCP follow a backbone friendly architecture of ingesting user traffic, which means they use their backbone more. AWS and Alibaba Cloud are in the internet intensive space.
What about IBM? Turns out IBM is hybrid and what it means by being hybrid is some regions of IBM Cloud are backbone friendly, for instance, the Washington, D.C. region of IBM Cloud is backbone friendly, but the Chennai, India region of IBM Cloud is internet intensive.
But I think the most important question that we need to know and answer is, what is the impact of these architectures on performance? We know that the internet was not built for enterprise communication. We know that the internet has no SLAs. And by its very definition, performance is best effort. So does that mean an internet intensive architecture or cloud provider is going to see variability in performance? Well, let's find out.
Last year, we noticed that AWS's internet intensive architecture resulted in a lot more variability in some parts of the world. What you're looking at here is network latency bidirectionally measured from user locations on the horizontal plane when users from those locations are connecting to the cloud provider's regions in Mumbai, India.
So if you focus on Asia, you notice that AWS's latency there is much higher than the other two providers. But notice that black line, the black vertical line within each bar. That denotes deviation from latency. What that means is, how much did latency swing by in our data collection period?
The other way you can look at it is, how accurately or consistently were you able to predict performance was going to hit an average latency number? Which means longer the line, lower the predictability.
AWS has the longest line there, lower the predictability, especially in Asia. Why do we see this especially in Asia? Because the quality of the internet is not as good there compared to the rest of the world. So we saw variability.
And we said, OK, can this be improved? Is this going to be improved or is this the state of performance with respect to AWS in Asia? Turns out it can be improved. What you're looking at now is a comparison of the 2018 and the 2019 numbers. Same metric measured from the exact same vantage points to the exact same hosting region.
AWS has improved in two ways. One, latency has reduced from those vantage points in Asia, but also that black line we were talking about, which is the variation, that's short as well comparatively. So the question is, OK, yes, it can be improved, but why? So we looked into path visualization.
What you see here is the network connectivity from users in two locations in Asia, one in Seoul and Singapore connecting to AWS's region in Mumbai, India. As you can see traffic go from Seoul to Mumbai, you see it's making a trip all around the world. Traffic is going to New York, and then headed to Bombay.
No wonder time spent and latency for AWS last year on Asia was very high because that was not an optimized route. And look at how long it spent on the internet versus how much time is spent on the AWS backbone. Yes, there is going to be variability.
When we looked at these exact same locations, this year we noticed a change. We noticed that routing was much more optimized. We noticed the introduction of a few peering locations and co-locations here, which showed that AWS has been improving and influencing how users get into their backbone.
Yes, the internet is still a risk for doing business, but now it can get better, which brings us to the next question. We know the internet is risky. We know that there is variability on the internet. But does that mean cloud backbones that are backbone friendly, do they provide a panacea for performance issues? Does that mean Azure and GCP deployments are going to be safe and see no performance anomalies? That's not true either.
We noticed last year in 2018 that there was a glaring latency divide here in AWS, GCP, and Azure. And as you can see, it takes three times longer for users in Europe to get to Mumbai on GCP compared to AWS and Azure.
Now, we dug a little further into it and found out why this is happening. And last year, Google's backbone-- now we all know that GCP is a strong proponent of using their backbone for moving user traffic. We noticed that there was no direct route from India to Europe or Europe to India. So the only way GCP got around the problem was to go around the problem, which explains the 3x network latency.
Now, when we started this research earlier in the year, we were really curious to see what changed because that was one of the goals of this research to see year-over-year change. And we noticed not much had changed.
Now, we know Google is making improvements to improve their backbone, improve their infrastructure. However, the effects of that has not gone abroad. Our vantage points from those exact same locations in Europe did not see an improvement.
Let's look at Azure. Azure did really well last year. They used their own backbones. They did well from the perspective of not just network latency, but also variation in latency. They did really well. So we wanted to understand, what is the year-over-year change? And it looks like a mixed bag.
What you're looking at here is the performance or latency variation. So that swing we were talking about, which means how closely and accurately can you predict your performance, right? So longer the bar, not good predictability. So you want that to be as short as possible.
So as you see here for Sydney, which is Azure's region, predictability improved. It improved by 50%. However, for India, it did not. It decreased by 30%.
So what have we learned so far, right? There can be improvements that can be made to an internet intensive architecture, right? Architectural gaps in cloud provider backbones can influence latency and performance, and architectural changes that are made in the backbone can be a mixed bag.
With architecture in mind, let's shift that focus to look at one of the most internet sovereign areas in the world, China. And how did the Great Firewall impact performance?
So when we added Alibaba Cloud into this report this year, one of the questions that we wanted to know and answer was, did Alibaba get a free pass through the Great firewall? So we measured that. And again, we measured performance from the perspective of network latencies and packet loss in this particular instance.
And if we looked at network latencies, they follow the principle of speed of light, right? Like, if you're going from China to a hosting region, your latency is around the range that we would expect across these providers. Packet loss was interesting. We noticed that irrespective of the cloud provider, traffic to and from China will be impacted with a packet loss, which will impact performance.
So what you see here is percentage packet loss from different regions around the world when you're going to a hosting location across those five providers in India, and you notice that any traffic coming from China suffers packet loss.
What we also did in this particular instance is we measured performance from our vantage points in China to Alibaba's hosting region in China and we did not see any packet loss, which means as long as you're within the boundaries of the Great Firewall, you will not see a hit in performance. But the minute you cross that boundary, you are going to be hit with packet loss and performance. But Alibaba Cloud does not get a free pass through the Great Firewall.
And the reason because why this happens is China is an island when it comes to internet connectivity. So when traffic from China goes out to other providers, the exchange does not happen before the firewall. It always happens after the firewall irrespective of the provider. So you're pretty agnostic in terms of impacting performance or the traffic there.
However, what if you did not want to host in China for some reason? What if you still have to serve your Chinese customers, but you're a little reluctant to pick a hosting region in China? We looked at performance in terms of what region could be better viable options here.
Turns out Hong Kong is a good option. And Hong Kong is a good option-- if you take a look at it AliCloud, Alibaba Cloud, and Azure, both do comparatively well in terms of network latency. You are going to see packet loss, but latency they're close to each other. Singapore is another good region. We noticed that AliCloud led that, followed by Azure and AWS in this particular case.
Now, we've been looking at performance in Asia, right? And if you look into the report, you're going to see a lot of anomalies in Latin America too. So the question that we wanted to answer next was, what are the mature markets? Actually, are there any mature markets when it comes to cloud provider performance? What are they, and are they free from performance anomalies?
So 2018 and 2019 research showed us that North America, Western Europe, are pretty good markets for cloud performance. All five cloud providers performed really well, and they are mature. But are they free from performance anomalies? That's what we're going to look at next.
We're going to shift this focus a little bit and look at performance from off broadband ISPs connecting to cloud providers. We've been looking at end users or users coming into cloud providers. Now, we're going to look at users from vantage points in various broadband ISPs.
Now, if you take a look at this graph here, you're looking at performance from those five cities in the US, but you're looking at it from AT&T all the way up to Verizon, while connecting to Azure's hosting region in the East. They're all comparable. Sure, there is a difference in terms of milliseconds between these providers, but it's negligible and it's not alarming.
But does this remain the same across the US, across providers, across cloud providers, across cloud regions? And the answer is for most parts, yes. However, there is an anomaly. There is an exception.
We notice that while traffic from Verizon vantage points or our vantage points hosted on Verizon is going to GCP's location in LA, it takes about 60 milliseconds round trip. So Verizon agents in LA or Seattle or San Jose, they take about 60 milliseconds to go within the same Coast. That was a really big number, and you can see that glaring number there and the anomaly right there.
We, again, had to ask, why? And it turns out that it was a simple routing issue or an anomaly that resulted in traffic from Verizon-hosted vantage points to hand over traffic into Google's network all the way in New Jersey to make that round trip back to LA. No wonder you were seeing 60 milliseconds, right?
Now at the time of this research when we collected this data, this issue lasted and it lasted for about a couple of months, but it doesn't exist anymore, right? So what does this tell us? It tells us that even the most mature markets, even the most mature cloud providers or broadband providers can run into anomalies and there could be short lived.
But if you didn't have the data to know what good looked like and if you didn't have the data to understand the deviation from good, then you're running blind in the cloud.
Finally, no cloud performance report would be complete if we did not look at the AWS Global Accelerator, a service that was specifically designed to improve the performance and availability of AWS workloads.
Last year, we launched our report around the same time in about November. And two weeks after our report at reInvent, AWS launched a new service for performance improvement. No coincidences there.
So we had to compare it. We had to look at performance with and without the Global Accelerator to see if there was any difference. So we looked at network paths, we looked at loss latency and jitter across both these instances. And what did we find out?
We found out that the Global Accelerator is not a magic pill for performance improvements. Or, in other words, your mileage may vary depending on where you're coming from, depending on which network your user is located, which AWS region you're connecting to. You could see difference in performance.
For instance, this is an example of using the AWS Accelerator service in Ashburn. And look at the user from Seoul. With the AWS Accelerator, if you're connecting from Seoul, you are going to see or you're going to see a performance improvement in terms of latency, but 80 milliseconds. That's enormous. That's amazing. So the Global Accelerator works for so.
Look at Bangalore. Your latency increases by 100 milliseconds. So it's probably not the best option if your users are in Bangalore. How about San Francisco? San Francisco the improvement is negligible.
We need to remember that while the Global Accelerator is going to take a different path, it's going to steer traffic to a different path through the AWS's backbone, latency cannot defy speed of light. You can't change physics. So you're going to see, you're going to be bound by your latency improvement.
But in the case of San Francisco, we did not see an improvement in jitter either. So the question is, how do you know? And if you don't have data, how can you validate it before you deploy a service like this?
Now, again, the performance of the Global Accelerator, performance of any cloud service depends on where your users are coming from, which network they're coming from. So this is just a snapshot.
In the report, you will find this exact same data set that compares performance from 38 global regions, user locations, to five locations AWS regions. And you're going to see the latency improvement in absolute, you're going to see the percentages of improvement as well.
So coming back to the question we started how do you win in the cloud amidst their race to cloud domination? There are three ways. First, start with data, not assumptions. There is no simple formula for performance in the cloud. What we've done with the report while it produces a wealth of data we're just scratching the surface, right?
You need to look at this data from your lens because every cloud deployment is a snowflake. So you need to have the data from the vantage points that are the most useful for you. Use the report as a starting point, but use data, not assumptions, for your deployments.
Number 2-- embrace the readiness lifecycle. There is no steady state in the cloud and we saw that. We saw how things changed for good, how things changed for the worse. We saw that the Accelerator may or may not have a performance difference. But if you knew that ahead of time, if you had that visibility, you can operate better in the cloud.
And then finally, drive accountability with your providers because it is possible, like we saw, that there can be anomalies in mature markets, and they can be recurring anomalies in not so mature markets.
And if you have that visibility and data, you can collaborate with your cloud providers and help them make things better for you because it is a competitive market and it's in their best interest to make things better for the consumer. So use data to drive accountability.
So going back to your question, to our question, how do you win in the cloud? This is how you win in the cloud and this is how you thrive in a connected world. Thank you.