Understanding the Meta, Comcast, and LinkedIn Outages


Own or Be Owned: Monitoring Networks You Don't Control

By Alex Henthorn-Iwane
| | 10 min read


At ThousandEyes, we have many web enterprise and other digital business customers but have also seen a rapid uptake of our technologies by mid and large sized enterprises, with over fifty of the Fortune 500 and 90+ of the Global 2000 as customers now. In my observation, a common thread as to why our customers across the spectrum are so keen on our modern take on network monitoring (which we call Network Intelligence) is to answer this question: How on earth do you monitor networks you don’t own or control?

Why Digital Businesses Need More Than Passive Data

For any business that primarily relies on Internet traffic delivery for revenue, it’s natural to think about networks outside of your control. For most digital businesses, that concern leads to measuring outbound traffic being delivered from data centers via various ISPs to customers in various regions. You can use passive data like Netflow, sFlow and IPFIX paired with BGP data to figure this picture out. The typical use case for this type of data collection is to manage peering and transit commits at your Internet edges.

But what about understanding the impact of external networks and services on digital experience from your customers’ point of view? Your users rely on a domino chain of providers such as DNS, CDN, DDoS mitigation and ISPs to get to your site. How do you know how these providers are performing? Since all those providers are multi-tenant and always on, in your user’s path to your digital front door, what happens to your performance if some other customer of your DDoS mitigation provider is getting hammered by an attack—will your users suffer too? What if DNS response time goes from 20 ms to 250 ms? Who do you escalate to in general when the problem is happening somewhere “out there,” and with what data? You need to know the answer to these questions, from the perspective of every geography that your users are in. However, you can’t use passive network monitoring data to answer these questions, for the simple reason that you can’t collect flow, pcap or SNMP data from networks you don’t control. Even APM, as wonderful as it is, has limited ability to give you root cause insight.

This is a lesson that a major banking customer learned the hard way when users couldn’t access their online banking site multiple times. Their APM tool (albeit awesome—we are all in favor of APM) couldn’t give them any insights because the problem wasn’t the website code or any internal infrastructure. Queue the hundred person war room calls at about $50K per hour (!!), and a dozen people breathing down the necks of network engineers who keep telling them that they don’t have any data to answer their questions but nobody seems to be listening (sound familiar?). This is an example of getting owned by the multiple dependencies in the Internet when you don’t have any visibility.*

Then There’s SaaS, Where You Don't Even Own The Code

Let’s take this from a different angle, where you’re not running production for a web enterprise or digital business division. You’re the network Ops team serving a hundred branch offices and you’re rolling out a SD-WAN or Direct Internet Access (DIA) to various SaaS applications like Office 365. Between your branch and the Office 365 data center, you have a few ISPs plus possibly a cloud-based secure web gateway (SWG) provider. We’ve already established that your traditional passive network monitoring gives you zero from these networks. On top of that, you don’t own the code so APM code-injection is also out in the cold.

So what happens when a project manager rolls out Sharepoint, where all users connect to a single data center, thinking that it works the same as Exchange Online (where you connect to a CDN)? We’ve seen customers struggle with this scenario multiple times—massive latency from various sites around the world, howls of user pain, then confusion, consternation, etc. Is it our network? Is it one of the many ISPs involved? Is it the Sharepoint app? Is it a cloud-based SWG like ZScaler?

Get Active, Now

When you can’t effectively collect passive data, you must evolve network monitoring to include active techniques. That’s a foundational notion for our modern version of network monitoring, which we call Network Intelligence. We’ve discovered that there a few key requirements to network monitoring in Internet-centric scenarios:

  • You need the perspective of every user. A user can be a human sitting in a branch office, home office, or remote location. A user can also be a microservice sitting in a datacenter or AWS, Azure or GCP.
  • App and network are equally important. When you’re dealing with networks and services that you don’t own or control, you need both sets of data.
  • Automated, visual correlation. One to one performance indicators analyzed separately aren’t helpful. You need all those app and network measurements analyzed together. And you don’t want your Ops team doing that in their heads, in Excel or on paper. Correlative analysis needs to be automated and visual to be helpful.
  • Shareable. You need to use this sort of visual data to get both internal and external teams to act together to solve problems and optimize service delivery.

You may not own many of the networks, apps or services that your business depends on, but that doesn’t mean you escape responsibility for the service delivery. Network Intelligence helps you regain control and own your destiny. If you want to learn more, request a demo or if you’re ready to get your hands on this sort of visibility, start a free trial.

* P.S. We helped them out and they found the root cause of the problem in minutes—it was a reachability issue between their CDN and that CDN’s transit provider.

Subscribe to the ThousandEyes Blog

Stay connected with blog updates and outage reports delivered while they're still fresh.

Upgrade your browser to view our website properly.

Please download the latest version of Chrome, Firefox or Microsoft Edge.

More detail