ISP Router Failure Takes Down Cloud Provider Services

The Internet Report, Episode 1 - Week of March 23-27, 2020

Over the past month, we’ve been flooded with questions about how the Internet is holding up given the extra strain it's been under with the sudden influx of remote workers, remote schoolers, and overall increased use due to COVID-19 related self-isolating and shelter in place orders. We’ve put out blogs and have conducted executive, media and analyst briefings. Network World and the IDG family of publications have even started publishing our data on a weekly basis to keep its readers up to date, as things are changing so frequently.

Because of the continued interest in how the Internet is handling the current and, potentially, increasing traffic loads, we decided that now is the right time to kick off a show to answer this question each week. How is the Internet faring? What were some of the most interesting events we observed during the week? I’m pleased to share the inaugural episode of The Internet Report.

Watch along in the video above, or read the transcript below. Don’t forget to subscribe to our blog and our YouTube Channel to be the first to get these episodes moving forward. And feel free to leave a comment here, on YouTube, or on Twitter, tagging @ThousandEyes and using the hashtag #TheInternetReport. We hope you find this info useful, and we look forward to your feedback.

Show Links:

Review the interactive share link of the March 27 Google Outage here and here.
Vodafone reports a 50% rise in Internet use as more people work from home
Verizon sees almost 20% increase in web traffic in one week due to COVID-19
A large-scale Cogent Communications outage impacted the Northwest United States — see an interactive view of the outage here.
Another Cogent outage impacted the reachability of Verily’s projectbaseline.com for users in Northern California — see an interactive view of the outage here.

Angelique Medina:
Hello everyone and welcome to The Internet Report. I'm Angelique Medina and I'm joined today by Archana Kesavan. We're going to be walking you through the state of the Internet and looking at the previous week and all of the interesting events and outages that have taken place. We're going to be doing this show on a weekly basis, but for our first episode today, we're going to be looking at performance, starting back in the middle of February, all the way through to the end of last week, which was March 29th.

We're going to primarily be looking at outages today and looking at the overall trends for the last six weeks, because there has been a lot of changes in terms of how traffic is traversing the Internet. We're going to see if there have been any noticeable impacts in terms of performance. Just to give you a sense of what our data set looks like, we have here at ThousandEyes thousands of sensors that are distributed around the globe. We have pretty broad coverage, particularly in North America, but also in Europe and Asia-Pac. All of these sensors are effectively measuring Internet and application performance. That generates billions of telemetry data points, and that data is being used to detect outage events. We're going to give you a little bit of a roll-up of these events across different providers.

Just to give you a sense of what we're looking at from a macro sense, so these are outages that are taking place across ISPs and public cloud providers, collaboration application providers, which we'll also refer to as UCaaS providers, as well as edge services. Edge services include CDNs and DNS and Security as a Service (SECaS) providers.

Across all of these providers, we've seen a pretty significant increase over the last six weeks. Just from the start of where we're looking at, this is February 17th, towards the end of March, we're seeing a 42% increase in the number of outages that we detected. The last couple of weeks have both represented peaks in the number of outages that we've seen, so topping 300 plus week-over-week the last couple of weeks. We've also seen some records, not only for ISPs, but also in some of the UCaaS providers as well. We're going to touch on that.

A lot of folks have asked, where are we seeing some of the major issues in terms of providers? We're going to look at cloud providers first. The interesting thing about the cloud providers, and again, a lot of folks have wondered how are they holding up given the tremendous amount of traffic that they are likely under, especially since there's a lot of remote workers and many enterprises are using cloud providers to host their VPN concentrators, their VPN gateways. They're getting a lot more inbound traffic, simply looking at it from the standpoint of enterprises and remote workers, and likely are also seeing increases from a consumer standpoint.

Overall, we're not seeing any significant increase in the number of outages. These numbers are pretty normal for what we would see with cloud providers and nowhere near the peak number of outages that we've seen from them. That is also the case in the United States as well. Even lower numbers, not all out outages. There's a little bit more on the week of March 16th. We saw a little bit of a rise, but that's again well within the typical number that we would see for cloud providers. Given that, is that something that's surprising to you, Archana? You've done a lot of work with cloud providers and performance. Is that something that you would expect?

Archana Kesavan:
Yeah, that's totally in line with our expectation, because these providers, if you've noticed over the last couple of years, they've been making some significant investments in their backbone and undersea cables and so on. In terms of infrastructure and bandwidth, they definitely have the capability. Any increase in traffic that they are seeing is probably well-handled. However, it is possible in the future, and even if we do see any outages of sorts, it's maybe because of a fat-fingering issue, something similar to what we've probably seen with AWS three years ago, but traffic overload necessarily does not have to create outages within these cloud providers. Yeah, I think the data is in line with what our expectations are at this point.

Angelique Medina:
Yeah. They know how to manage and keep a massive network running. When we have seen outage events, they've typically been due to configuration issues or just infrastructure failures, which are not something that are related to traffic surges. Those are events that you can't really plan for. Overall, they know that they're handling a lot of traffic today and they seem to be holding up well, which is interesting. We'll cover this again, because one of their, at least speaking of Google, one of their senior VPs made a comment about COVID-19 specifically and how they're holding up under the increased traffic surge.

Archana Kesavan:
In terms of the infrastructure, it has to kind of override all the resiliency and backups that they have in place already. Yeah. That's something we should just be aware of when it comes to these cloud providers. They're actually really well-prepared to handle the surge in traffic and also any failures that they might have.

Angelique Medina:
Yeah. It's interesting because you would also think ... If we move on to the ISPs, they also are fairly well-provisioned to quickly scale up in terms of their ability to handle traffic loads. Vodafone mentioned they saw a 50% increase in traffic. I think it was Verizon said something like a 30% increase, and they're doing fine. Now, we have seen an increase in the number of outages across the ISPs. Now it's a much larger bucket, of course, than the cloud providers. It's a lot of different types of networks and ISPs. This includes both mobile providers, broadband providers, as well as transit providers, but there has been a notable uptick. You see this “step up.”

This was a new peak in the first week of March. All of a sudden there's a pretty dramatic increase in the number of outages. Then subsequent to that, we saw, again, a new record made the week before last. Then last week again, a new record. All of this represents more than 200 outages in a given week, which is not something we've seen in a long while. We've only seen that at one other point. This is pretty noticeable, just overall. It's not coming down. It seems to be on the rise. Why is that? It's not clear at this point. It could be that the ISPs are making configuration changes or they're changing their peering relationships. There are other things that are going on that might be impacting that, because as mentioned, the ISPs themselves have said that they're fairly well-positioned from a capacity standpoint.

But we're also seeing this in the US as well. Even just looking at, for example, the minimum outage week that we saw looking at February 24th, 59 outages. If you contrast that with last week where we saw 120, I mean, that's a 100% increase in the number of outages. Again, there was a pretty dramatic step up going into March, and then that hasn't really gone down all that much since then, so something that we'll keep an eye on. We wouldn't expect that the ISPs would just suddenly be overwhelmed, but there could be other reasons why we're seeing this number of outages.

This is interesting too, because the UCaaS providers, so these are the collaboration application providers, they are under a tremendous amount of strain. I mean, they've said that they've seen unprecedented levels of traffic. They have new users. If you think about all the remote workers and distance learning, I mean, this is a pretty sizable increase in traffic that they're experiencing. We typically very rarely see outage events within the UCaaS providers, and it held pretty steady until the week before last. This was the week of March 16th when we saw a really significant spike in outage events. This was the case not only globally but also within the US. This was a 467% increase from the previous week.

Now, it has started to go down and hopefully we'll see that trend continue, where it just represented an unusual week for those providers and now they're starting to hopefully adjust their infrastructure and their capacity to handle more traffic. It has gone down and hopefully, we'll see that continue this week. Now, is that surprising to you at all, Archana, that we would see this kind of a trend where it just dramatically spiked and then went down?

Archana Kesavan:
I think they're making changes to their infrastructure, probably in the back end to accommodate a lot of these large influx. Like you said, it's not just remote workers. There's the distance learning and everybody ... Kids are on these platforms right now. I think there are definitely some back end configuration changes that might have unfortunately caused an outage. I'm hoping over the next few weeks it kind of tapers down.

One of the examples we did see the week of March 16th, is a UCaaS provider suffered some DNS-specific issues. Not necessarily that their DNS service went down, but we anticipate, or we are speculating that it could possibly be a configuration change to their DNS records to handle the surge, to handle the influx. Not surprising, but hoping that that actually tapers down because it's so critical right now in terms of being connected and being productive.

Angelique Medina:
Yeah, absolutely. The DNS issue, that was just in addition to some of these network issues that we were seeing. The DNS service itself, as you mentioned, was totally fine. This was apparently a configuration issue. Again, they're probably making a lot of changes on the backend, but it looks like it's headed in the right direction. They are able to scale out pretty rapidly. We're hoping that that downward trend continues.

Angelique Medina:
Now, it's interesting, last week we also saw another instance where there appeared to be some kind of outage event that was not related to traffic searches within a cloud provider. We know this because one of the cloud providers... Well, in this case it was Google, and one of their executives who's responsible for the network tweeted out that there was an issue where they had a router failure in Atlanta. This issue was completely unrelated to any surges in traffic that were COVID-19 related. He made a point to specifically point that out, because I think a lot of people are concerned about whether the major providers are able to keep up with traffic.

Angelique Medina:
Now, to talk about this particular outage, because we had a lot of back and forth on this, we're joined by Deepak. Deepak is joining us from Dublin. Thank you for making time later in the day.

Deepak Ravisankaran:
Absolutely, guys. Really happy to be on. This was quite an interesting event because the impact of a network failure in Atlanta had a massive impact. If you look at this data set, this is specifically from our ThousandEyes employee end-user experience data set, and we have a number of users on the left side, different regions of the US, trying to access a number of Google services like Docs, Calendar, Drive. What's very obviously noticeable is a couple of locations on the East coast, Atlanta and North Carolina, in this case clearly having huge network loss failures. Immediately I'm able to pinpoint it to a node in the Google network that had issues. This was an immediate way for me to know that, yeah, I can see what the tweet said and I can see the failure in the network layer.

If I move up towards a slightly different view, which kind of paints a different picture, but this is the end-user experience score data set and you can see how different products within the Google suite are all showing red, all having dips in experience score, and telling us that yes, indeed there was a large group of users affected by that simple failure.

But then Angelique, you were looking at something interesting on a different data set where the errors were not very network specific. Do you want to talk a little bit about that?

Angelique Medina:
Yeah, that was really interesting because one of the things that we saw was that yes, there were users who were complaining about issues on the East coast, but we also saw on social media that there were other folks who were saying they were able to reach Google's site but they were basically getting errors. What we saw there was pretty interesting because, over that period where the incident took place, we saw that from locations around the US, there was intermittent availability. We were seeing, for example here, that some users were getting errors, some were getting receive errors, but mostly were seeing 500 server errors.

This is interesting because basically what it means is that the network was totally fine. You were able to get to the site or the front door. We didn't see any packet loss or latency reaching Google's edge, but then you get to the site and you're seeing that you're basically getting these errors. Why would that be related to this router issue in Atlanta? This is something that we discussed quite a lot over Slack. Why don't you tell us kind of what your take on this is?

Deepak Ravisankaran:
Oh yeah. On the data set that I was looking at, users were accessing a specific node or specific server in Google, and it happened to traverse networks that had failures in the Atlanta region and we saw it as a network failure, but in the data set you were looking at, we were seeing the network path to the front end servers being completely clean, but then we see a 500 error, which essentially tells us when the front end server tried to perform a redirect or retrieve data from the back end server, there was a problem. Immediately, the server throws out a 500. What we can potentially say is that when the front end servers were trying to redirect to the back end server, there might've been a network failure in that path leading to a 500 error, which makes sense. It's also interesting that a network error would present itself as a 500 to the users. We initially thought that this was an application error. We initially thought that Google was having a back end problem until we did further investigation and figured this out.

Angelique Medina:
Absolutely. I think that really speaks to the distributed nature of applications. If you think about Google and how they deliver their service, yes, you're reaching their front end, but there are all of these dependencies on the back end, different parts of the application, databases, other pieces of the service that have to be reached over a network, even within Google's own network and infrastructure and how they are building out their application. Even though this network issue took place in Atlanta, it was impacting users on the West coast, on the Southwest, all across the US. That's very interesting. It's unclear if there are differences in changing network patterns potentially due to increased traffic loads, which is why traffic was going through that region for a variety of users, or maybe that's normal. We don't know, but either way, I think it's interesting that it points out the network dependencies even on the back end, not just in reaching a service, but also in completing a service.

Angelique Medina:
Yeah, it's very interesting we had a couple of other notable outages. In particular, there was a really significant Cogent outage. We don't have time to get to that today, but we'll likely post some additional details on that on our site and we'll have some share links in the show notes.

Archana Kesavan:
All right. That wraps up this week's show. Angelique, thank you so much. And Deepak, I know it's late for you in Dublin, so thanks for jumping in and giving us your insights as well. And for everybody out there, I hope you guys found this interesting. Tell us what you feel. Leave us a comment or a tweet at our Twitter handle.

And if there's anything specific you want us to cover in the next episode, leave us a comment as well. Feel free to also follow us on our blog, thousandeyes.com. Every time these large-scale outages happen, kind of do a deep dive and lay out what we saw from our perspective. And we'll be covering the state of the Internet in our blogs as well. So definitely feel free to follow us there. All right, thanks for watching and we'll see you guys next week.

The Internet Report

Ep 1: ISP Outages On The Rise, Router Failure Takes Down Cloud Provider Services During COVID-19

Summary

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

related blogs

Summary

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

related blogs

Upgrade your browser to view our website properly.

Subscribe to the ThousandEyes Blog

Stay connected with blog updates and outage reports delivered while they're still fresh.