New Podcast
Managing Traffic During Peak Demand; Plus, Microsoft, Akamai Outages

The Internet Report

Ep. 14: India Swipes Left on TikTok, GCP Outage Hits Multiple AZs, & Cloud Networking 101 for Enterprises

By Angelique Medina
| | 25 min read
Internet Report on Apple Podcasts Internet Report on Spotify Internet Report on SoundCloud

Summary


Watch on YouTube - The Internet Report - Ep. 14: June 29 – July 5, 2020

This is the Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. On this week’s episode, we cover a recent move by the government of India to ban many Chinese-owned applications, including TikTok, which reportedly has more than 600 million downloads in India. We also talk through a service disruption at Google Cloud Platform on June 29th that affected multiple of its availability zones within a single region, and we briefly cover other outages at Slack and Comcast, too. Google shared their Incident Report of this event following the recording of this week’s episode, so you should check that out for the latest info. After our review of this week’s highlights, I sat down with Atif Khan, CTO of Alkira and former co-founder of Viptela to talk enterprise cloud strategy.

Find us on:

Finally, don’t forget to leave a comment here or on Twitter, tagging @ThousandEyes and using the hashtag #TheInternetReport.

Catch up on past episodes of The Internet Report here.

Listen on Transistor - The Internet Report - Ep. 14: June 29 – July 5, 2020
ThousandEyes T-shirt Offer

Follow Along with the Transcript

Angelique Medina:
This is the Internet Report, where we uncover what's working and what's breaking on the Internet and why. I'm Angelique Medina, and I'm here with my co-host, Archana Kesavan.

Archana Kesavan:
Hey guys.

Angelique Medina:
So, lots of interesting stuff happened last week. So, there was the tension between China and India and this happened around the same time in which apparently India decided to ban around 59 Chinese-built apps among which was TikTok.

Archana Kesavan:
Right.

Archana Kesavan:
There was a military dispute wherein about 20 Indian soldiers were killed sometimes in the middle of June. And following that India decided to ban about 59 Chinese applications. And like you mentioned, TikTok was one of them-

Angelique Medina:
But there were a lot of users in India or have been that have been using TikTok. Right?

Archana Kesavan:
Right?

Angelique Medina:
Something like more than 600 million downloads of the app. And then of course there was also other TikTok news that was also related to the political situation. And that was TikTok has pulled out of Hong Kong. TikTok is a Chinese company but it's not available to users in Mainland China and now it's no longer available to users in Hong Kong. So some interesting news about how kind of the political situations in various countries are impacting what's available to users from content consumption standpoint.

Archana Kesavan:
Right. And really might want to keep an eye open on how the US is going to handle TikTok, as well, in the upcoming weeks. I saw something earlier this morning today that indicated that US is also going to think about shutting down TikToK because of growing concerns. So stay tuned for that.

Angelique Medina:
And then of course, I'm sure a lot of folks heard about the almost two-and-a-half hour GCP outage in US, East one. A couple of sites that were impacted.

Archana Kesavan:
The US East one is the South Carolina region. It's not necessarily the most popular region. The Ashburn still is but the US East one had especially services hosted in availability zones... Two have the availability zones, which interestingly is almost 66% of that region. That particular region only has three availability zones and two of them were out for more than two and half hours.

Angelique Medina:
So that was B and C that went out.

Archana Kesavan:
C and D that went down.

Angelique Medina:
B and D. And what's interesting is that... I mean the whole point of availability zones is that there's some separation in terms of whether it be connectivity or power. And when they later reported that it was a power outage that took this down that was a little bit surprising because it suggests that they actually had shared resources and maybe redundancy is not... Maybe that's something you need to factor in or think about if you're building out redundant architecture.

Archana Kesavan:
Right. I think the whole concept of availabilities on trade and how every provider lays them out can be unique. And one of the things that we discovered in our annual Cloud Performance Benchmark Report, the 2019 report, is that there was some variation in terms of inter-availability zone latencies and some regions when GCP or AliCloud, Alibaba Cloud, were really small. So it always raises that question. When you think about availability zones as to how distinct are they from a resource consumption perspective, are they physically separate or are they just different floors within a data center? Right?

Angelique Medina:
Right. Yeah. You definitely want to be asking those questions because, to your point, I think there's a lot of assumptions that the way that the cloud providers work is somewhat uniform. Right? So when somebody says they have an availability zone in one cloud provider maybe that's handled differently in a different cloud provider. And the same on in terms of networking constructs, which is actually what we're going to talk about today. We have a great guest, Atif Khan from Alkira.

Angelique Medina:
And one of the things that he talks about is just the difference in the networking constructs between the providers. So always good to understand how your specific cloud provider operates.

Archana Kesavan:
Right. Totally. In talking about outages, I think Slack released a more in-depth root cause analysis of an outage that took place in May. I think May 12th, Slack had almost a 45-minute outage starting at 4:45 Pacific. So towards the end of the business day. But that kind of took down connectivity to Slack and was kind of a global blast radius of the outage. So you can read the detailed analysis but in short, it turns out that there was a bug in their load balancing infrastructure, which kind of derailed sync between their proxy and their web app tiers that affected the platform.

Angelique Medina:
Yeah. It's a great write-up and kind of details how effectively the mapping was not properly working at the time whatnot. Even though the application itself was up and running, the assignment of instances was malfunctioning and because of that many users weren't able to use the service.

Archana Kesavan:
We actually go through this outage in-depth in episode eight. So if you're interested in that, feel free, check that out as well.

Angelique Medina:
Yeah. Great. I mean overall, like from an outage standpoint and just in looking at for example, network outages last week, they were down, overall. So we starting to see numbers that are closer to what we saw in February, which is good. Although outages are kind of a fact of life if you're operating a network and particularly a large network. And we did see last week that Comcast had a pretty notable outage that took place on July 4th, during peak usage hours in the evening. And this impacted multiple regions in the United States: Central, East and West. And that's what we're going to take a look at next.

Archana Kesavan:
The extent of the outage, Angelique, as you were saying it originated... Started out in the Midwest and then spread across to the East and the Western regions.

Angelique Medina:
Yeah, that's right. So it started around or just after 5:00 PM Pacific time. So this would have been I guess 3:00 PM, sorry, 7:00 PM Chicago time. So 7:00 PM Chicago time we see that their infrastructure is impacted in Chicago, so this is their backbone-

Archana Kesavan:
Was there backbone?

Angelique Medina:
Yeah that's right. Yeah. But what's interesting is that it's not contained to Chicago. A few minutes later we see that now the impact has kind of expanded. Okay. So the first 10 minutes or so is Chicago and then... Oh, here we go. Okay. So first five minutes, Chicago, and then we see New York, we see San Jose and we see Seattle as well, pop in. So this then kind of just persisted for some time. I mean, this was like a 30-minute outage. Yeah. And again we're talking about really important kind of peering points or transit points, New York, San Jose and so on. And then just kind of continued on.

Angelique Medina:
And this is regardless of which time zone you're in, this is probably something that you would have noticed assuming you weren't out like watching fireworks or something like that, in which case maybe you wouldn't have. But pretty notable outage, not only for its length but also just for the number of locations that were impacted. And an outage like this is typically caused by either a misconfiguration or something that's impacted the control plane. It's not like a router just died. Right? Has it's multiple locations.

Angelique Medina:
So, I haven't seen anything on this from users or from Comcast. It would be interesting to see if anybody else noticed it. And if Comcast has anything to say about that. So that's kind of the highlight from this week in terms of an interesting outage that we caught. So that was our notable outage. Last week, we talked a little bit about resiliency given what happened with Google last week, in terms of multiple availability zones within a particular region going down. Also with the IBM cloud outage. It's been a topic that's come up a few times over the last several episodes. We had Jason Black from Uber come on a few episodes ago to talk about their resilient cloud architecture and the strategy that they have.

Angelique Medina:
So we thought we'd continue this conversation and invite Atif Khan, who is the CTO of a multi-cloud networking startup called Alkira. Atif, previously was the founder or one of the founders of Viptela. So he has a lot of background in enterprise networking and works with a lot of enterprises to help them kind of implement a very flexible kind of agile networking strategy with their cloud providers. So really excited to have him on and that stuff next.

Angelique Medina:
Thanks Atif for joining us.

Atif Khan:
Nice to be on the show. Thank you for having me, Angelique.

Angelique Medina:
So what about traditional enterprises, when they are contemplating moving to the cloud or maybe even some kind of hybrid or multi-cloud strategy, first of all, why would they do that? What's what are sort of some of the drivers for embracing multi-cloud for an enterprise?

Atif Khan:
That's a very good question, Angelique. So there are multiple reasons why enterprises are adopting multiple clouds. One is like a pretty simple and straight-forward where each cloud is different from what they offer. So in certain cases, one cloud offers something which the other cloud doesn't offer or application in one cloud runs better or one cloud is more suitable for a certain type of application than the other. So you choose the cloud based on that.

Atif Khan:
Secondly, cloud redundancy is one of the major factors, as well. The third reason, which we are seeing out there, some of the large enterprises which we are working with, especially or they started with one cloud but they ended up acquiring some companies, which happened to be in different clouds. And then they just instantly became a multi-cloud companies. So there are different reasons. And redundancy as you said, resiliency, redundancy, high availability, all of that is one the key reasons.

Angelique Medina:
Yeah, absolutely. So, okay. So they're moving towards these types of deployments. Maybe they've inherited this because of an acquisition. Maybe they simply need to do that because they want to use some services and some of the cloud providers or just for resiliency purposes.

Angelique Medina:
I know for example, that a lot of the cloud providers have very different networking constructs. So Google is very different from say, Azure, AWS, in terms of like how the regions work and how you would kind of connect between across different regions. And even if you're just working within one cloud provider, you be working across different availability zones and across different regions. So what are some of the challenges that you see enterprises encounter when they start to contemplate moving to the cloud?

Atif Khan:
When you connect to multiple clouds, each cloud is very different from when it comes to, as you said, when it comes to the underlying networking construct. So you as an enterprise or enterprises, they have to learn or they have to have staff which have expertise in each of these clouds or they have to build expertise in each of these clouds. And just building that expertise takes time. And nowadays cloud network is just an extension of your private network. So it's not different from your on-prem or your private network. It's extending into the cloud.

Atif Khan:
So you have to be able to run and manage your cloud networks similar to the way you're managing and running your private network. So you have to make sure that when you deploy your cloud network, it's deployed in a resilient fashion, it's highly available and you have complete visibility into what's going on into the cloud, across clouds, on-prem to the cloud, cloud to the Internet. So you have to have... Visibility is one of the key aspects of deploying or is one of the key requirements.

Atif Khan:
And then you have to be able to manage and control from a routing perspective, from the routing policies perspective, from, in general, the traffic policies perspective. Your cloud the way you control your on-prem networks.

Angelique Medina:
Yeah. I mean, it's interesting. I was talking to somebody who had helped effectively roll out the cloud deployment for a large healthcare provider. And he had mentioned that it was really kind of an evolution, their rollout. They moved to the cloud, they started deploying and then they kept kind of going down architectural col-de-sacs, where they realized like, "Oh no, we can't scale that way. Oh, no, like that's not going to work." And then having to make changes.

Atif Khan:
Cloud networking is not just about connectivity or providing connectivity into each of the clouds. Right? Sometimes that's the easier part of the whole thing. It's also about making sure that you secure the network, maybe in certain cases you move your security services into the cloud as well. Maybe there are requirements where you need to move your other services into the cloud as well. And then being able to deploy all of this based on the use cases because each use case has different requirements from what is required from the network perspective, as well.

Atif Khan:
So to give you an example, there might be use cases where traffic is going from on-prem to the cloud. There might be use cases where there is traffic between clouds or within even one cloud but between different regions as well. Or there might be a use case where there's traffic, which is coming in from the Internet and it's going into the cloud, your servers or your applications might be sitting inside the cloud. So you might need different types of services to front-end the cloud based on the use cases. Sometimes there are certain things which are very hard to accomplish.

Atif Khan:
And again, giving you a quick example. If you have stateful, let's say a firewall, which you're putting in front. If you have a use case where traffic needs to go through a firewall and then hit your destination or wherever your application is sitting, you might have a requirement where you need to auto-scale your stateful services in certain cases that you need to make sure that the traffic flows are symmetric and the services scale up and down based on the capacity requirements or the load through those services.

Atif Khan:
So all of that adds to the complexity. So if you can do all of this in a simple manner across clouds, then that helps enterprises a lot. Basically, as an enterprise, you don't have to worry about the underlying limitations of the cloud providers’ infrastructure because as a customer of these clouds, every time you spend something up in the cloud from a networking perspective, you have some limitations which are imposed on you. Maybe those are number of routes or some bandwidth limitations.

Atif Khan:
Then you'll have to take care of all of your deployments, as far as let's say, if you're going between different regions, like inter-region routing. And if you have multiple clouds and intercloud, all of that routing segmentation. And nowadays these networks that are global in nature.

Angelique Medina:
Yeah. And you can't manage what you can't see. So visibility is so critical, especially if you're in the cloud, you don't know what kind of performance you get unless you validate that you're getting what you need, for example... So it's just important that even though it's not your infrastructure, you're still using it, you're still responsible for it. So you should know what you're getting. I really appreciate you coming onto the show. This was a really fascinating, pretty new area. So glad you could share some of your thoughts on this with our audience.

Archana Kesavan:
Angelique that was a really great interview. What did you take away from your discussion with Atif and that you think it would be worth sharing.

Angelique Medina:
Yeah. What was really interesting was how Atif talked about really being thoughtful about your specific use case. You may have different requirements depending on business unit, depending on where users are coming from. So maybe you have traffic that's coming into a cloud provider and you might need to think about security in a particular way, like a firewall or maybe a load balancer, but maybe not have those requirements for inter-app connectivity. Really thinking about your deployment beyond just connectivity and all of the different services that you might need and the performance requirements that you have because they're going to vary depending on what your use case is.

Angelique Medina:
So really trying to map your use case so that you're not over architecting and you're being thoughtful in terms of the resources that you're using. So that was really interesting. It's always about the use case at the end of the day and context is kind of everything.

Archana Kesavan:
All right. So that's all we have for this week. Don't forget to subscribe, which will get you a free T-shirt by emailing InternetReport@thousandeyes.com with your address and size. Also, registration is open for the state of the internet and that's coming up on July 16th, which is next Thursday. So we have a lot of amazing speakers there. Fastly, David Belton from the Internet Society, Verizon, and then we have Angelique, too. Angelique is going to be actually talking about some really interesting data in terms of how the Internet has been performing over the last few months. So don't forget to sign up for that. And with that, we will see you next week.

Subscribe to the ThousandEyes Blog

Stay connected with blog updates and outage reports delivered while they're still fresh.

Upgrade your browser to view our website properly.

Please download the latest version of Chrome, Firefox or Microsoft Edge.

More detail