Never a dull minute on the Internet! In today’s episode, Archana and I dove into a YouTube service disruption and an (unrelated!) Google network issue in India. We also discussed Slack’s explanation of their service disruption last week, and even talked through a case out of France where an education site experienced performance issues in lockstep with time-of-day usage.
Give this week’s episode a watch or a listen and then come on back next week on Tuesday, May 26th, because a) Monday is Memorial Day for us in the U.S., and b) because we know outages just love to happen on holidays. ;) But seriously, we’re looking forward to next week, when we’ll be joined by TeleGeography’s Alan Mauldin to discuss submarine cables, terrestrial networks, international Internet infrastructure and more.
Find us on:
Finally, don’t forget to leave a comment here or on Twitter, tagging @ThousandEyes and using the hashtag #TheInternetReport.
Show Links:
May 14th YouTube service disruption interactive snapshotMay 14th Google network interactive snapshotMay 12th Slack Status updateMay 12th Slack service disruption interactive snapshotCatch up on past episodes of The Internet Report here.
Follow Along with the Transcript
Angelique Medina:
Welcome to the Internet Report. My name is Angelique Medina, and I'm joined by my co-host, Archana Kesavan.
Archana Kesavan:
Hey guys.
Angelique Medina:
And what we're going to cover is all of the events that are worth noting that happened on the Internet last week. And last week was actually a very busy week for us because we saw a service outage at YouTube, there was an issue with us with Slack on Tuesday of last week, we also saw a pretty big Google network outage, and what else? There were also some regional sites that had some issues. So we're going to cover all of that on the show today so lots to unpack, a lot of application-related stuff, a little bit of network stuff, then we'll do some stats.
Angelique Medina:
So with that, we'll go ahead and get started. And as I mentioned last week, and I'll go ahead and mention that again here, for those of you who do not subscribe yet to the Internet Report, we cover a lot of really interesting stuff. A lot of it is network-related or internet related, but a lot of it is also related to the availability and performance of applications and services like we're going to spend a lot of time with today. So go ahead and check us out. You can subscribe to us on YouTube, and anywhere that you get your podcasts.
Angelique Medina:
So with that, we're going to start with the YouTube service outage, because that is the outage that really got the most attention for users because it was also a global issue. So the way that it unfolded was, we see here this happened on May 14th, which is last Thursday, and it happened around 4:00 or so.
Archana Kesavan:
Around 4:00, 4:30, 4:45 PST.
Angelique Medina:
That's right.
Archana Kesavan:
Yeah.
Angelique Medina:
So it basically was a situation in which users all over the globe, so not just in the US or on the West coast, were just simply not able to load content on the YouTube site. And YouTube themselves later issued a statement, basically saying... So this came from YouTube, they put a tweet out saying that if you're experiencing error messages, that they had fixed the issues and they had said that it really only lasted about 20 minutes.
Angelique Medina:
So what we saw during the outage is pretty interesting. It really does corroborate what YouTube was saying in terms of the duration of the outage. So what we can say right up front is that it wasn't a network issue. So we were seeing that for all of the paths from users to Google's front door, their edge servers, everything looked pretty clean. And we keep mentioning Google here, because of course, Google is the owner for YouTube, right. YouTube is one of Google's services. So we could see, for a variety of locations around the globe, and I'm just going to go to this map here so you can see what we're looking at from a cover standpoint. So in this particular instance, we're testing to youtube.com, and this represents a pretty nice distribution of users around the globe. Got US, Europe, Australia, Asia. And just at a first glance, everything looks green here. And we can even see that during the time of the incident, so again, this is around 4:20 PM on the 14th, the HTTP, all of that looks good. So basically Google's front door is reachable.
Archana Kesavan:
So this was interesting because people were not complaining that they couldn't load or get to YouTube, they couldn't load the videos on YouTube. So reaching the front door was not the problem. Just the availability of the server, as you're seeing here, was not necessarily the problem. I think the detail lies in terms of when the page is actually loading, what's happening.
Angelique Medina:
So what's interesting here is that we can see, in this particular time series data up here, is basically tracking how long it takes on average to load youtube.com. And during the period of time when people are complaining, the page load time actually went down. And at first glance, you might think, "Oh, that's actually a good thing, right? You should be able to load the page faster." But in fact, what this is indicative of is that there are certain components that are really critical to the service that is just not able to load and that's why there's the overall time to load the page is reduced.
Angelique Medina:
So if we just look at an example here, so let's just click on Ashburn. So again, we see they are experiencing the same thing. There are all kinds of stuff going on here. What's pretty notable is that during the course of the outage, we are seeing that there are basically 500 errors.
Archana Kesavan:
From one particular competent, like the Ajax file right there.
Angelique Medina:
Right, yeah. So there's browse ajax object is throwing up a 500 error. What's interesting about this is that if you go back in time to when the service was available, we don't see this error anymore and the same thing afterward as well. It goes away, but during the outage, it's there. The other thing that's interesting because ajax is basically ... It's a mechanism that's keeping a request flow in place as you're interacting with a webpage. And so when looking at what the page overall looks like before, during, and after, we can see, for example, that before, when we weren't seeing this particular object, we can see that they're loading some of these domains. So these video related domains and everything looks good during the course of the outage, they're not available. And when the outage is addressed a little bit later, they're back-
Archana Kesavan:
They're back.
Angelique Medina:
Right?
Archana Kesavan:
Mm-hmm (affirmative).
Angelique Medina:
During those 20 minutes, we don't see them and then they come back. And that's really been the case across the board with other locations as well. So we can see like Bucharest, for example, similar behavior.
Archana Kesavan:
So I think one thing that was interesting here in terms of the time it took to resolve, especially in, with that stray ajax showing up is in some locations, we noticed that it took about 10 minutes for the issue to get resolved, but in some other locations, it took about 20 minutes. Russia was a good example there where we saw that it took about 20 minutes for the issue to get resolved, and it was interesting just given the time when this happened. This happened around 4:30 PM in the west coast of the US, probably pretty late at night in Russia.
Angelique Medina:
About 2:30 in the morning, yeah.
Archana Kesavan:
Right. So if there was a push in terms of resolution that was done, which was obviously done, it rotated around the more critical areas first, which is intelligent and I think a smart way to handle any changes.
Angelique Medina:
Yeah. That's interesting because we saw that it took about the 20 minutes that Google indicated for Moscow, Kazan and Turkey, which they're not going to be at the top of their priority list because it's in the middle of the night and people are probably not heavily using the service, but it came back faster, not only in the US but even in the UK, in which it's still around midnight. So that's interesting in terms of how the service outage was resolved. So had some objects that, or an object and a particular object that we saw, that was consistent throughout the service outage that was throwing a 500 error, a lot of other components of the page just weren't ... Simply were not available during that period across the board. And then it resolved within roughly 10 minutes for users in the US and Western Europe, and then a little bit longer, 20 or so minutes for some users. In particular, we saw Russia, Turkey.
Angelique Medina:
So that's a little ... It was a bit of a brouhaha on social media about not being able to use YouTube, which is an essential service at 4:00 PM ish in the afternoon on the West coast.
Archana Kesavan:
On a weekday.
Angelique Medina:
On a weekday, of course, yeah.
Archana Kesavan:
For 20 minutes.
Angelique Medina:
That's right. Pretty tragic. So that's what we saw as far as the YouTube service issue. And as YouTube says, they acknowledged it and had it resolved within about 20 minutes. Now what was interesting was that on the same day, we also saw a sizable network outage. So again, this was not related to any Google, specifically to a Google service or application related, it was infrastructure within Google's network in the India region that was having issues. So we can see here, this was again, same day, May 14th on a Thursday and it was 22:45 UTC. Somebody has to do the math on that in terms of what time it was on the West Coast, where I am in. And of course Archana, you're in New York on the East Coast.
Angelique Medina:
So we saw here that for users in India, there was a pretty significant number of nodes here that was impacting the reachability of Google services. So users in that region connecting to a variety of Google services may have experienced a disruption in reaching some of those services.
Archana Kesavan:
So that was around the same time, around the YouTube outage, but they were mutually independent. One not impacting the other. So it was around 4:45 PM. It was around the same time.
Angelique Medina:
This is interesting because when we started to look at some of the complaints that people had on YouTube, some of them even had mentioned Google and having issues there. And it's interesting because in looking at this, you can definitely go down the path of if could it be network related? And this happened around the same time, but it's completely independent of this application issue. So just because it was correlated from a time standpoint doesn't necessarily mean that it was the cause of the issue. Sometimes, many different things can go wrong. Just having a spot of bad luck, apparently, for Google on May 14th. So moving on from that, the other major service outage that came up, and this was actually earlier in the week than the YouTube issue, that was Slack. And we definitely noticed it because suddenly, I started getting a whole flurry of emails and text messages and just wondering what was going on and it turned out the Slack was down.
Archana Kesavan:
We were talking about it, the YouTube and the Slack outage, they happened around the same time, around the 4:30, 4:45 PST time frame, two days apart.
Angelique Medina:
Yes, you're right. They did happen around the same time.
Archana Kesavan:
Yeah.
Angelique Medina:
So what Slack had said was that they had an issue with basically registering a set of servers or the number of servers that they needed to provision. So they provisioned, presumably they provisioned the servers they need to meet user demand, but they apparently did not successfully register those servers with their load balancer and because of that, traffic wasn't distributing across those servers in an optimal way. And therefore, that degraded the health of the available servers, which then led to users not able to access the service in a meaningful way. So a series of unfortunate events where one thing leads to another. So this was their very brief statement on it. Hopefully, when they complete their post-mortem analysis, there'll be additional information on this, but it does align with what we saw during the course of the outage.
Angelique Medina:
So this was in the midst of this. We see a lot of red here. And what this is showing is that we're effectively getting HTTP errors across most locations. And in some locations, we're just simply not getting a response within five seconds and we're effectively timing out. So clearly this is global. This is not network related, at least reaching the front door, and it does clearly seem to be something application related.
Archana Kesavan:
Comparing this to the YouTube, the YouTube was not network related too, but as HTTP was successful and that was related to how the page was loading and a particular component having an issue, this is very interesting because this is again not necessarily a network issue as you can see here reaching the front door of Slack, but it does translate into an HTTP issue because the front servers were not necessarily responding because of a timeout or because we were actually getting another HTTP error as the service was unavailable itself.
Angelique Medina:
Right. So we were getting a 503 error, which we'll take a look at in a moment, but basically, it just translates to the service isn't available. And so no part of the page itself was actually loading. So just to ... Just because they talked a little bit about their server infrastructure and the fact that they were using a load balancer, just wanted to take a quick look at their delivery architecture reaching their front door.
Angelique Medina:
So we can see here that there's a number of locations we have on the left-hand side connecting to their front-end web servers. And they appear to be, Slack, in this case, they're hosted in AWS and you can see the specific compute instances they're using in various locations. You can also see that their IP addresses are fairly distinct. So most locations are connecting to a specific IP address or many different IP addresses. It's not in anycast service by any means. So they're clearly using some mechanism, probably DNS, to distribute traffic across their front door. The other thing that's interesting is that not all of their locations are self-hosted.
Angelique Medina:
That's right, yeah.
Angelique Medina:
So we saw at least two, the Cape Town location where we see we're actually hitting CloudFront’s PoP, and then the other one was-
Archana Kesavan:
I believe Cairo.
Angelique Medina:
Yeah. So Cairo is interesting because you're connecting here to yeah. Where did Cairo go?
Archana Kesavan:
All the way down. Here you go.
Angelique Medina:
Cairo is also connecting to CloudFront.
Archana Kesavan:
CloudFront, yeah.
Angelique Medina:
At a different PoP. So the Cape Town location was connecting to a PoP for CloudFront in Cape town, and then Cairo is connecting to a CloudFront PoP in Frankfurt, which is interesting because we saw during the course of this incident that we were getting different response headers as part of this 503 error. So-
Archana Kesavan:
In some cases, we noticed that the web server or the front door server, like Angelique was saying, was just not responding. So it was timing out after five seconds. So one was, we were not getting any response from the server itself. And the other condition was we were getting a 503 service unavailable response, but from Cairo and Cape Town specifically, we were seeing a response code that did indicate that they were front-ended by CloudFront for one, and then it was a ex cacher from CloudFront right there if you notice.
Archana Kesavan:
Which is interesting.
Angelique Medina:
Yeah. If we look just at Cairo, for example, we know that their front-ended by CloudFront, and we can see that during normal conditions in which you're getting a correct okay, that you're getting this miss from CloudFront, which probably indicates that CloudFront is doing what it's supposed to be doing and not caching what it's ... It's not caching anything, but it's just basically the front door and then is retrieving what they need to retrieve from Slack. So that's, under normal conditions, they're basically just the go-between for a user and the origin.
Angelique Medina:
Now, if the other front-end servers that Slack is hosting are behaving in a similar way where they have to go back to the origin or some back-end application server to service requests, that could be where the issue is and that does align with this 503 error. So it's possible that the server infrastructure availability or resourcing issue was on the back-end and that's why it was so consistent across global users because it was a fundamental availability issue.
Angelique Medina:
So that's just a very quick look at what we saw there. Again, in keeping with what Slack had to say on the issue, and maybe when they provide more details in their post-mortem, we can dig a little bit more into how that corresponds with what we're able to see.
Archana Kesavan:
Right. Right. At this point, we can jump into our weekly analysis of how outages we're doing. Let's get that right here. All right. So this is again the week by week view, which we get into in terms of what the network outages are for the week of May 11th. Actually here, I'm going to pass.
Angelique Medina:
Throwback. Throwback. Yeah.
Archana Kesavan:
May 11th right?
Angelique Medina:
Yeah. We're just correcting and optimizing in real-time.
Archana Kesavan:
All right. Okay. We are going to go back again. All right. So week by week, we analyze network outages in terms of how ISPs are doing, how cloud providers are doing, and so on. So the week of May 11th, as you can see here, we see a little bit of uptake in terms of the global outages, which includes all different network events, which can be in ISP networks, which can be in cloud provider networks. A little bit of increase there, nothing to be specifically alarmed about. In terms of ISP outages, a similar trend with the global outages, but then cloud outages, as you see here, especially in the US has definitely gone down from three outages to one this week. So it's always interesting to see how this trend line is going. We're seeing a little bit of stability week by week, and we'll get to this again next week.
Archana Kesavan:
But I think what's interesting here is that in our last episode, we had Arash, a senior researcher here who works on work on outages and how we look at outages in the backend, and he was talking about how the network outages and these numbers that we talk about here really refer to when there is a 100% terminal loss in the infrastructure and the underlying network infrastructure. But outages do not have to necessarily manifest that way like we saw in the case of YouTube wherein you couldn't necessarily load particular videos. And then in the case of Slack, there was no issue with getting, from a network perspective, getting to the front door of Slack, but then there was another back-end issue that was probably causing that.
Archana Kesavan:
What we saw interestingly is here, let me get to this particular view here. And what we saw in this particular instance is a French online portal, a learning portal that kids are accessing and over time for the last week, we noticed this increase in the page load times of this particular website. And as you can see, co-maps to time of day period. In the morning when the usage is high, we're starting to see some slowness, but then in the evening, things taper down here. I think what was really interesting is when we dug into the details of it, we noticed some display images, which were not really critical to the website, but just the images were taking really long to load. For instance, if you ... Let me click on this image and this is actually ...
Archana Kesavan:
Yeah and this doesn't happen when there are off-peak hours. Those particular images, we don't see them taking that long a time. This was really interesting because these images that were in the case of that for the experience period during the morning were actually being loaded from OBS. So it was like a third external third party that was actually causing and interrupting the performance, and it's interesting. I think they started out on 18th and we're starting to see this disruption again. So just something to keep in mind that outages don't necessarily have to mean a packet loss, 100% packet loss. It's also degradation in terms of how the user is experiencing a particular website.
Archana Kesavan:
And now given that everything's online, everybody is accessing these things online, there is some pressure that usage and demand are creating on these services. Right, with that, let's go back here and Angelique do you want to introduce the State of the Internet?
Angelique Medina:
Yeah, absolutely.
Angelique Medina:
Just on the outage numbers, if you want to drill down into that a little bit more, you just go to thousandeyes.com/outages and or outage, and you can get all of the numbers there. So we brought this up last week. We are going to be putting on a virtual summit called the State of the Internet. This is taking place on June 18th, and we're putting together the agenda for the event. We're going to be unveiling some new research. We are going to be moderating panels with folks talking about internet performance during the recent period, talking about BGP route security, as well as the future of the internet. We're going to have folks that are network operators for internet infrastructure, as well as enterprise coming in and talking about their experience recently and what they expect going forward. So it will be really exciting.
Archana Kesavan:
Right. We have, if you've been watching our series, David Belson from the Internet Society is actually going to be a part of this event, and he's going to talk about internet resiliency in particular. We also have EdgeCast on our agenda, as well. So the agenda is building and we'll keep you updated as how that's turning out.
Angelique Medina:
Yep.
Archana Kesavan:
All right. With that, we are almost at the end of our show. And again, as Angelique mentioned in the beginning, if you're still not following us, it's about time you do so. We're available in any of your favorite podcasts here. And then if you follow us, you will get this free shirt here or email us at internetreport@thousandeyes.com and we'll send you that T-shirt “working safely from home.”
Archana Kesavan:
And then before we wrap up, next week is going to be interesting but we're recording on Tuesday because it's Memorial Day on Monday. But I think the fun part of next week's show is we have Alan from TeleGeography who is going to join us and really talk about submarine cables, the evolution, and how it's been holding up during COVID.
Angelique Medina:
Absolutely, yeah. We chatted last week so this would be a really interesting conversation. We'll probably talk about a number of things, not just submarine cables, but some of the research they do, how they map all of this and yeah. So we'll see what happens this week so we can talk about it next week with Alan.
Archana Kesavan:
Thank you for joining us. We'll see you next week.
Angelique Medina:
Thanks everyone for joining us.