Major T-Mobile Outage + Uber Cloud Architecture Scale

Watch on YouTube - The Internet Report - Ep. 12: June 15 – June 21, 2020

This is the Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. On this week’s episode, we cover a widespread T-Mobile outage that took down its cellular network for several hours and elicited a rare condemnation from the FCC. The culprit, according to the carrier, was a fiber cut—highlighting the need for redundancy and resiliency in the nation’s cellular networks. We also cover an issue with What’s App’s privacy settings that sent users scrambling to Twitter, as well as a recent move by Russia to “un-ban” the messenger app, Telegram. Then, stay tuned as we go one-on-one with Jason Black, the Head of Global Network Infrastructure at Uber Technologies, to discuss how Uber approaches its cloud architecture.

Find us on:

Finally, don’t forget to leave a comment here or on Twitter, tagging @ThousandEyes and using the hashtag #TheInternetReport.

Catch up on past episodes of The Internet Report here.

Follow Along with the Transcript

Archana Kesavan:
This is the Internet Report, where we uncover what's working and what's breaking on the Internet—and why. Before we get into the headlines today, stay tuned to listen to Jason Black, who is the Head of Global Network Infrastructure at Uber Technologies. Jason is joining us later on in the show to talk about how Uber builds cloud applications for resiliency and redundancy. Onto the headlines right now.

Angelique Medina:
Yeah, the big story from last week was clearly the T-Mobile outage. This is a very large mobile carrier, third-largest in the United States. And they were impacted on Monday of last week, June 15th for nearly 10 hours. So this affected their LTE network and there was a lot of chatter on social media about it, and even the FCC stated that this was simply unacceptable that a service like T-Mobile would be down for that period of time.

Archana Kesavan:
We're going to cover a little bit more about the T-Mobile outage in our “Under the Hood” section. But other than the T-Mobile outage, there was a little bit of chatter going on about WhatsApp as well, and this was on Friday, June 19th. Although the service faced no disruption in terms of actually communicating with people, there was a glitch apparently that blocked their privacy settings, or you had issues in updating your privacy settings. Unclear why this caused any stir because it did not really affect the actual platform itself, but still it made it up to social media.

Angelique Medina:
Yeah. And there was some other news related to a messaging application. There was … Russia lifted its ban on Telegram. Right?

Archana Kesavan:
Right.

Angelique Medina:
This would have been the past couple of years.

Archana Kesavan:
Yeah, so Russia has some strict anti-terrorism laws. It requires messaging services to actually provide the authorities with the ability to decrypt messages. And Telegram actually refused to do that or were noncompliant to that, because of which they faced the ban, I believe in April of 2018. And as of last week, Russia lifts that ban on Telegram. Which is good to know, because in spite of the ban, people were finding different ways to actually use Telegram either through VPN services and so on.

Angelique Medina:
Great. Yeah, so we're just going to go straight into a little bit of a deeper dive on the T-Mobile outage last week. As we mentioned earlier, this affected the LTE portion of their network. And so, just to kind of quickly cover what we saw during this period, which was we had some tests that were going through their network backbone, and they were all fine. They were not impacted by this outage. So it was really just the voice data calls that were impacted. Right?

Archana Kesavan:
Right, right. Exactly. And a few of us have T-Mobile. That's our service provider for our phone, and we noticed that we couldn't obviously make any phone calls over their LTE network. We couldn't send any text messages either. However, we were able to use T-Mobile's LTE service to get on your Slack or any other type of data services. Now, it turns out that T-Mobile did come up with the root cause, where they said there was a fiber outage that kind of overwhelmed a very critical piece of your LTE infrastructure called the IMS Core. And just to provide a little bit of perspective there as to how a mobile network looks like, this kind of a really high-level version of what a mobile network looks like. And if you're connecting over LTE, you're coming in from this EUTRAN piece all the way, passing through what's called an evolved packet core, which is a packet switch network going all the way through the Internet.

Archana Kesavan:
Now, that only is the case if you're dealing with data, which is, say you're on LTE and you're using Slack or any other kind of data services. However, if you're making a voice call over this network, you're kind of hitting what's the IMS Core, which is truly critical in terms of the signaling component of a voice call or a text message. And, according to T-Mobile, this was the piece of the network and infrastructure that was overwhelmed because of a fiber cut, which kind of explains why we couldn't necessarily make any phone calls over LTE or even send text messages over LTE.

Angelique Medina:
What's interesting, this sort of reminds me of the Comcast outage a couple of years ago, where there were two fiber cuts actually. And the result of that was that it severed parts of its control plane. And so they had connectivity in one part of the country and separately in another, but because their control plane was impacted, it had a really huge impact across the network. So fiber cuts, if it's impacting the really critical part of the infrastructure, and obviously the control plane is sort of the head, then it can have a pretty massive impact. So it's plausible, their explanation, in terms of the fiber cut potentially impacting a really critical part of their infrastructure, and then in turn that can have a cascading effect. Although, it's interesting that there wouldn't be more redundancy. In the case of the Comcast outage, apparently there was a fiber cut that was still in the process of being addressed and then they had a second fiber cut.

Angelique Medina:
So there wasn't a problem with the first one, it was the second one. So it's going to be interesting to see is this an example where there should have been more resiliency in place? But we don't have the full story, obviously.

Archana Kesavan:
Right, we don't. And based on T-Mobile's root cause, they did express that they should have more resiliency, not just on the fiber piece of it, but also on the IMS Core infrastructure. And that's something they mentioned that they're going to be working on.

Angelique Medina:
Great. All right. So you sat down with Jason Black of Uber. I'm really excited to hear your interview with him.

Archana Kesavan:
Yeah. Up next, Jason's going to be talking about how to design apps in the cloud, especially because a lot of the outages that we have been seeing over the past few weeks, like the IBM Cloud outage a couple of weeks ago, was kind of an external third-party that was disrupting applications in the cloud. So Jason has some really interesting perspectives on how should you think about it and then see what does Uber do, and things like that. So up next is Jason Black from Uber.

Archana Kesavan:
Welcome to the Expert Spotlight. This week we have Jason Black. Jason is the Head of Global Network Infrastructure at Uber Technologies and covers datacenter, backbone, PoPs, and the cloud for the production advanced technology group networks within Uber. He is a technology and business visionary with hands-on experience in growing multibillion-dollar web-scale companies, as well as his previous startups. Jason, thank you so much for being on the show.

Jason Black:
It's a pleasure to be here. Thanks for having me.

Archana Kesavan:
The first line of why it was interesting to have you on the show was kind of related in terms of some of the outages that we've been seeing recently. Last week, on the show we unpacked IBM Cloud’s outage, which was caused by an external network provider flooding routes into IBM's network. Now, a lot of these outages on the cloud and the Internet cannot be prevented, and in a way, it's kind of the price you pay for agility and convenience of the cloud. So what should enterprises do to protect themselves from some, if not all, of the outages when deploying applications at the cloud?

Jason Black:
Good question. So just as a company would think about redundancy and resiliency for their on-prem environments, the same consideration should be made for their cloud environments. A company or a service should always avoid vendor lock-up by way of spinning up services that can be replicated in other cloud providers or on-prem. This is certainly easier said than done as you would imagine, given that each cloud provider offers slightly different offerings. But it should still be kept in mind when making these decisions.

Archana Kesavan:
Right. And does that indicate multi-cloud is a best practice then?

Jason Black:
I think it really comes down to what you're looking for within your application stack and where you need things. Certain things, like databases. Certain databases can be had in one cloud provider that can be had in another, but they might be slightly different service offerings that you may or may not need and want. And you would also want to consider how you're going to strategize that with on-prem. Because if you ever have to fail out, you have to be able to support that business model.

Archana Kesavan:
So Jason, how do some of these redundancy and resiliency best practices factor into services that Uber builds?

Jason Black:
So Uber has a tripod strategy, which is publicly known, you can go out there and Google it. But how we do this is kept a little bit closer to the vest as you might imagine.

Archana Kesavan:
Absolutely.

Jason Black:
Yeah. Having said that, I can state, as previously mentioned, that Uber always looks to avoid being single-threaded to either on-prem or cloud.

Archana Kesavan:
What does that mean?

Jason Black:
So the way that we look at things and we expect that our cloud providers do the same, is to have multiple availability zones in multiple regions. So where we may appear to have what would be single-threaded, between having, say, our data lake in an on-prem environment, we don't just have it in one zone or one region. We have it across multiple zones and multiple regions. And the same thing is what we would expect in our cloud providers. So we talk about front end SSL termination so that the app can terminate to the cloud. We would expect that if that service were to fail, that they would have a redundancy, a resiliency model to allow us to continue that service within the cloud. However, if it doesn't, then we have a backup option for that, which is our on-prem.

Archana Kesavan:
Got it. Okay. So if I had to unpack this and kind of visually structure this, you have multi-cloud environments, so that's one level of redundancy that comes into play. Within cloud, you rely on the redundancy options they provide, which is availability zones and regions. So if a cloud region fails, you switch over to another cloud region.

Jason Black:
Correct.

Archana Kesavan:
But then you go one step further and then your own data centers are also created the same way, wherein they have the concept of availability zones and regions. You kind of almost have redundancy at every layer possible.

Jason Black:
That is correct.

Archana Kesavan:
So, if we had to walk through a scenario of an outage, for instance... And I think you alluded to this, like the SSL service, which is kind of the front end where a user would come to, if that on the cloud service fails, you try to go to another region. But then if all those regions are out for some reason, what does the workflow look like? And I guess more importantly, how does this affect the end user?

Jason Black:
Sure. So just going back, if you're referring to the cloud provider not being able to provide that SSL termination for us, our application has been built for that type of awareness, and this failure still would allow for our application to be self-aware and go ahead and make a direct call to our data centers. And that would alleviate having that front end terminate to the cloud. All of this is completely transparent to the end user and happens within milliseconds. So our application goes ahead and does those initial calls to the cloud, and then immediately when it sees that it's not answering, straight to the data center.

Archana Kesavan:
You kind of bypass the cloud itself if there is an outage.

Jason Black:
Absolutely.

Archana Kesavan:
Okay. That's awesome. And then the reverse of that, where your cloud is working okay but your data center and redundancy there fails for some reason, how does that impact any workflow?

Jason Black:
Well, I'd like to think that we've done our best to design a network that's resilient to failure and taking into account having some sort of distribution and replication of workloads across our availability zones and our regions. But again, these type of failure outages remain transparent to the end user. And it's not really a flip of a switch, per se. It's application awareness. It's just the way that our software stack has been built to be able to handle these types of failures or outages.

Archana Kesavan:
So it goes not just at the network and the infrastructure level, but also at the app level there is redundancy and resiliency built into it.

Jason Black:
That is correct.

Archana Kesavan:
Okay. Yeah, that makes sense. One final question, Jason, before we let you go is, despite all this redundancy, resiliency best practices that you have in place, what still keeps you up at night when it comes to cloud deployments?

Jason Black:
Let's see, that's a great question. So I think we've all heard the adage that your cloud is somebody else's computer, and this case the data center, right?

Archana Kesavan:
I should have worn that T-shirt. Actually, we have a T-shirt that says, "Your cloud runs on my network."

Jason Black:
I happen to have many of your T-shirts and I appreciate them all. But most don't question what's going on in these data centers. We do our research as companies to try to know what regions they're deployed in, what their zone diversity looks like. And you want to assume that they're following best practices and they're being followed from monitoring, all the way through the triage of incidents. I can only control what happens in my data center, but unfortunately I can't control what happens in others. So we have to have faith in the way that things are being built, both on-prem and in cloud. And as you alluded to before, we are copying exactly what best practices are for ourselves, and we assume that others are doing the same. So whether it's on-prem, cloud, it should be a seamless interaction for the end user.

Archana Kesavan:
Well, Jason, thank you so much for being on the show. It was a pleasure hosting you, as always.

Angelique Medina:
Thanks, Archana and Jason, that was a great discussion. A lot of really great points, including the fact that most IT professionals today have to deal with a lot of infrastructure that underlies their services, and they don't own much of it. So you don't own service provider networks, you don't have control over it. Same thing with cloud networks and a lot of infrastructure. Again, if you're in the cloud, you don't own the underlying infrastructure. But you're still responsible for the performance of the services you deliver. So there's some good lessons in there on still needing to understand and get eyes on all of the different pieces and all of the underlying dependencies for services that you're offering.

Angelique Medina:
So with that, we're going to go ahead and close out the show. Thanks for listening in. We have a great virtual summit, State of the Internet, coming up in a few weeks. The registration page for that event is now live, so you can go to thousandeyes.com/events and sign up to attend. We have great speakers from Fastly, CenturyLink, APNIC, Akamai, and more. So really excited about that. And of course, as always, if you have any questions, you want to offer any suggestions in terms of topics or speakers, you can drop us a note at InternetReport@thousandeyes.com, and don't forget to hit subscribe and follow us on Twitter. With that, take care.

Archana Kesavan:
Take care.

The Internet Report

Ep. 12: Major T-Mobile Outage Caused By Fiber Cut, and Talking Cloud Architecture at Scale with Uber

Summary

Catch up on past episodes of The Internet Report here.

Follow Along with the Transcript

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

related blogs

Summary

Catch up on past episodes of The Internet Report here.

Follow Along with the Transcript

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

related blogs

Upgrade your browser to view our website properly.

Subscribe to the ThousandEyes Blog

Stay connected with blog updates and outage reports delivered while they're still fresh.