Inside Distributed Monitoring Infrastructures

This is The Internet Report, where we analyze outages and trends across the Internet through the lens of Cisco ThousandEyes Internet and Cloud Intelligence. I’ll be here every other week, sharing the latest outage numbers and highlighting a few interesting outages. This week, we’re taking a break from our usual programming for a conversation with Brennan Hildebrand, Senior Manager for Engineering Operations at Cisco ThousandEyes about distributed monitoring. As always, you can read more below or tune in to the podcast for firsthand commentary.

Distributed Monitoring: Seeing the Internet from Every Angle

If you’ve ever wondered how global organizations maintain meaningful insight into their network performance, or what it takes to run a fleet of monitoring agents across thousands of locations, you’re in for a fascinating look behind the scenes.

In this episode of The Internet Report, our special guest Brennan Hildebrand shares first-hand insights on the operational challenges and technical strategies involved in building, validating, and expanding a global monitoring infrastructure. From handling agent orchestration across diverse ISPs and cloud environments to troubleshooting unexpected issues like DNS misconfigurations, Brennan’s experiences reveal the complexity—and necessity—of distributed monitoring. We’ll discuss:

Why multi-location monitoring matters: Learn how visibility from diverse networks can help uncover blind spots and detect regional outages, invisible failures, and performance degradations impacting end users.
Debugging methodologies: Understand how distributed agents help pinpoint root causes of network issues for faster resolution.
The value of collaboration: Find out how vendor relationships can influence operational workflows and monitoring effectiveness.
Evolving monitoring demands: Hear how application architectures, generative AI, and mega data centers are reshaping monitoring requirements and agent deployment strategies.

To learn more, listen now and follow along with the full transcript below. For additional insights on distributed monitoring, explore this blog post.

A Conversation on Distributed Monitoring

BARRY COLLINS: Welcome to the show, Brennan. Why don’t you start by telling us about your background and how you ended up leading Cloud Agent Operations.

What's your day-to-day role look like, and what does your team actually do?

BRENNAN HILDEBRAND: Well, I I've been in the industry for quite a while. I was a software engineer by trade for most of it, but for the last eight years or so, I moved into leadership and that was while I was at AppDynamics and I came to Cisco through acquisition in 2017.

So, that's when I started my journey at Cisco. I was at AppDynamics for seven years, so four years at Cisco. Then I started looking around at different things. I wanted something new, some new challenges. I saw this role at Thousand Eyes, which basically ticked all the boxes that I was looking for. It's SRE, leadership, interesting problem to solve with lots of little fine detailed problems to deal with, plus a group of people that are just really collaborative and awesome to work with. And so I've been here for about three and a half years now, leading the cloud agent operations team the whole time. I was drawn by interesting tech, global visibility, edge monitoring. I mean, there's so many cool things involved in this project.

Day to day, I'm involved in planning some scrum management and a little scrum master action for our sprints every couple of weeks. Cross team comms—you know, I work with other teams quite a lot, both in operations and outside of operations. Incident management, handling all of that stuff. Vendor management, which is a big deal for us.

The team as a whole, we’re working on automation for... well, we've got about a little over a thousand locations, and a bit over 10,000—maybe 11,000 by now—actual agent instances that we manage.

So, we do upgrades, and we expand the fleet to new locations as we believe our customers will want or our customers have requested. We try to be open to any customer requests at all. And incident response, of course, as I mentioned earlier.

We're also actually branching out from the cloud agents a little bit. We're taking on SRE support for our device agent, which is a fairly recent product that we put out there, and also the BGP monitoring stuff that that we're really trying to grow. So we're trying to take sort of a similar model with that. We don't have them in a lot of locations yet, but we're trying to grow that out basically to the world. So yeah, that's pretty much what we do.

BARRY COLLINS: So talking of monitoring, why isn't it sufficient to just monitor from your own data center or a handful of locations? What specific blind spots emerge when you only have visibility from limited vantage points?

BRENNAN HILDEBRAND: Limited monitoring is limited visibility. If you're only doing, you know, your DC or a group or cloud providers even, that's a lot of locations, but it still isn't really enough. If you're just offering a service that's got API calls, then cloud providers might be all you need to look at. But if you've got a website that's got a front end that wants eyeballs, things like that, you want broader visibility.

You do want to focus on the areas where you expect your customers to be, but those areas are generally very large. The internet touches a lot of places and people are going to come from wherever. With that, what'll happen if you keep it light, you're going to miss regional issues. You know, maybe your sites running great and most of your customers are having, you know, great access to it, but you've got a set of customers that are really having problems and, and we can give you visibility into whether or not that's the network that's, that's doing that.

There are routing issues that you're not going to catch. Maybe it's just a slowdown for some of your customers. Is the routing tables sending you through an ISP somewhere that is maybe not really the best IP for your traffic? And so it really gives you those advantages. You get real user paths out of this. We hit a lot of the major metros, of course, but we're trying to do smaller cities, smaller metropolitan areas as well. So, you really get to see the path all the way from those locations, your customers, to your service or your application. And that's something you just don't get if you keep it close.

BARRY COLLINS: Mike, from your perspective, why does having agents distributed across different provider networks help with troubleshooting?

MIKE HICKS: The core question is this: Is it just me?

So, from that single vantage point, you're going to see, as Brendan said, just your network's view only. We've got multiple providers in the same city with different ISPs. I'm going to want to understand if it's ISP specific or is it a systemic issue.

Providers sort of route things dramatically differently. It might be the same target, but they're going to take different paths depending on which ISP. One might have peered directly with, say AWS, and have great performance, where another one doesn't. One might transit through three different networks, whereas another one goes sort of straight in there.

So, you know, over the period we've seen sort of these “invisible failures.” We've seen outages where BGP routes stay advertised, but the traffic was black holed. And then externally, everything looks fine. There were no route withdrawals, but the customers from that network couldn't actually get anywhere beyond that path so we needed to understand what's going on.

And then you have the concept of CDNs and the DNS, they're going to behave differently by the source. So, the CDNs can select different servers based on the source network. The DNS is going to give different answers based on where the query is actually coming from.

Ultimately, this then helps us in this troubleshooting design tree. Are all agents failing? Therefore, it's a target problem.

Are we only seeing within one particular ISP? Then it's that ISP routing issue.

Are all agents everywhere having the issue? We need to put it down to a service.

By looking at from one individual vantage point it's impossible to determine, which is why you need these distributed vantage points.

BARRY COLLINS: Brennan, you're managing agents across ISP networks, cloud providers, data centres, mobile edge. What are the operational challenges that aren't obvious until you're running this day-to-day?

BRENNAN HILDEBRAND: There's quite a few of them. One big one is because of the diversity of ISPs and where we run these agents, you know, everybody is a little bit different. We have to take different strategies for managing each one. We generally don't control the low level infra, so we can't do PXE boots, things like that.

We have to rely on our vendors to do the OS installation, especially on bare metal. It's a lot easier with cloud providers, of course, but, OS installs can be kind of tricky. DHCP is tricky. We get IP blocks from all these places and, and managing those is a pain, you know. We manage our own DNS. It's got as many entries as we have agent instances. Over 11,000 we're managing at this point. So that's quite a challenge.

And the diversity also creates exceptions in our automation. We have to handle the differences between them and then code it up so it's the same interface, basically, for the tooling. It just knows how to deal with this or that or the other thing.

Plus, we use Ansible for our bare metal stuff, for doing deployments and maintenance and management. But for our cloud providers, we use Terraform. So there's two separate paths that we have to go for that. But there's a lot of similarities once you've got the infrastructure down between the two, so it's an interesting balance.

Then fleet wide ops are a lot of work. These are especially important for vulnerability management, for making sure that the CVEs are covered quickly and we're getting stuff up. So, we need to be able to do rolling restarts across 11,000 different instances. Patches need to go out to everybody. Updating the agent needs to go out to everybody. And, while, you know, it, seems atomic, it's not. The fleet operation is 11,000 separate little things. Managing the state of all of those is quite complicated as well.

And we've got some tooling at the base of it, Netbox is actually what we use. But even Netbox is difficult because it's really designed for a data center's use. And so, some of the objects don't really align to our use case, although it's been getting a lot better recently, which is really nice.

We have to align the platform with what's going on in the field and make sure everything's going well. It's a lot of work.

BARRY COLLINS: I’ll bet it is. When an agent starts behaving unexpectedly—DNS resolution spikes, packet loss increases—what's your debugging methodology? How do you figure out if it's the agent, the provider's network, or something upstream?

BRENNAN HILDEBRAND: We usually begin by taking a look at our own data to see what we've collected around that. In the past, we look at, you know, past tests, we create a test, we create baselines on these things so we can have a good feel for whether or not it's outside the threshold.

And we start using that data to look at it: Okay, well, what's going on here? With heavy packet loss, is there some place on the way where, you know, our flow that we've got in our dashboards just ends? And so we do that, but lower than that, we use some pretty standard network tooling, traceroute and things like that. We've got our own sort of modified traceroute stuff, of course, but we use a lot of the standard network debugging tools really to figure it out.

And if it's unclear, we'll start looking at us first, to see if we're to blame—we want to eliminate that as fast as possible, because if it is us, we want to get on it and fix it, you know, boom, immediately. It's, you know, anything that affects customers is just top priority. All hands on deck, everything else gets dropped.

So yeah, so again, we use our own tooling for this, we dig into it and basically trace it farther and farther out until we can get to the end and find what's going on—if it's not obvious. I mean, thankfully we've got the ThousandEyes tool because it makes a lot of it obvious. It's really helpful.

BARRY COLLINS: When you were testing AWS Wavelength Zones, to ensure the network quality was in line with what Cisco ThousandEyes customers expect, you discovered a route 53 DNS misconfiguration during pre- deployment testing. We’ll link to the blog post covering this in our show notes—but walk us through what that validation process looks like, what you tested and how long a new agent location sits in validation.

BRENNAN HILDEBRAND: Generally, the minimum is a week. For emergency situations—say we lost an ISP and are trying to get that location back up—well, we'll shorten the validation time a little bit just to get it back to our customers. But generally, we do two weeks validation.

We run agent to agent tests, page load tests. We look at pack loss, latency, jitter, routing, DNS, SSL, various wait times, and we take that data and we compare it to known good baselines. So we've got a set of locations, both for page load tests, with websites that we know are very solid. We've used them over the years. We run tests on them all the time, so we know that they're good clean locations.

And for the cloud agent stuff, we'll do basically agent to agent tests, where we've got a set of known good cloud agents that we can use, so it's very similar. We create baselines from those and we look at it and see how they're doing, basically. And if they pass all the tests for two weeks, boom, they're out there in the wild.

MIKE HICKS: Going back to when you were talking about understanding the issues and that customers come first and that is, from a practitioner's perspective, that baseline to me is incredibly important because when I'm actually looking at the data, I don't want to be second guessing. I want to be able to say this data that I'm actually seeing there is the issue I'm actually looking at—as opposed to “Is this some sort of echo that's coming from the agents themselves?”

So, the fact then that during that validation process, you bed that in with that baseline is good, because that gives us something to go from. That's starting point: This is the best it can possibly be. Then we put the tests on top of that and now, I'm actually looking at the conditions, the actual user experience itself.

BRENNAN HILDEBRAND: Yeah, for sure. Absolutely. Very true.

The AWS Wavelength thing was a little bit different. That took us a long time because we set it up for the two-week testing and we didn't like what we were seeing. It looked wrong and we did our standard thing. We looked at us first, tried to see if there was anything in the way that we were doing that might cause the slowdown and the packet loss that we were seeing until, you know, we, finally got to DNS, with route 53 and started digging into their configuration and stuff and discovered that there was a misconfiguration there and we communicated that with AWS, they fixed it, and we released the agent.

So, it was a lengthy process for that one, but a really good one. I mean, it was great because it proved our methodology. It also helped AWS with their product to make sure that it was working. It was fairly new to them as well as to us, at the time. So, they were looking for all of this. And for any anybody that uses AWS Wavelength, this find is going to help them as well.

BARRY COLLINS: The AWS Wavelength story involved working directly with AWS. How often do provider relationships become part of your operational workflow? And what does that collaboration actually consist of?

BRENNAN HILDEBRAND: Our provider and vendor relationships are very important. We want to stay on their good side, certainly, because they're providing us a service pretty well. Some of it depends on the vendor themselves or the provider themselves. Some of them are more communicative than others, but we subscribe to all their outage reports, any kind of feeds they've got that we can get that kind of information from.

But with an organization like AWS, we've got a much closer relationship with them. They're in our Slack, even. Some of that is, not necessarily because of the cloud agents themselves. We use them a lot across operations, but it's really helpful when we come across something like this. It's very collaborative. We worked with them very closely on the Wavelength stuff. They were kind of excited to do it. They wanted that kind of visibility there for that. So that was really nice. It was great having that collaboration with them. It's less close for some of our other providers.

And then, you know, some of our bare-metal vendors, we’re really close. Again, we've got them in Slack as well, and we can get to them as soon as we see something. And some of our vendors are even good enough to catch stuff before we do. First of all, they're going to tell us if there's any upcoming maintenance. But even like if an unplanned maintenance comes up, they're going to be hitting us up about the same time that we're finding out ourselves. It's great to have that relationship because that conversation has already started with them when a problem arises. That's really powerful.

MIKE HICKS: With the AWS one it's really interesting because this was again a real-life use case where effectively, AWS at that point became the customer and we were helping them to deliver a service. There was a performance degradation on route 53 which was a service that they were trying to offer out through the Wavelength stuff. And then, using the solution, you're able to go through and troubleshoot this and get down to there. So that then became that, as you said, this real collaborative [relationship]. They can then use that going forward, right? So, we've now optimized a service and we can continue to use the data to continue to optimize that service—were there things that we learned from that that we could then build into our mitigation plans going forward?

BRENNAN HILDEBRAND: It depends on the nature really of the outage. I mean, sometimes an outage is fiber cut and it's just all gone. But wonderfully our dashboards will go “boof!” You know, they'll show when traffic was lost. You're right. We don't go offline, you know. The agent goes offline but the product does not.

MIKE HICKS: Yeah.

BRENNAN HILDEBRAND: And we continue attempting to monitor it. We ping the agent and,we get a lot of information from our vendors. They're, they're usually the first ones to figure out exactly what's going on. Then of course, in the large outages, we also do followups—I think on this podcast, as a matter of fact, which is a great thing.

And, as far as mitigation goes, for some of those things, it's pretty difficult to mitigate anything, I mean, things that are sort of within our control, we certainly can, but a lot of them are not. From my perspective, given my years in the industry, I'm amazed how stable the fleet is. The majority of our problems are something downstream from us or upstream from us, I guess. Yeah.

MIKE HICKS: Despite my youthful looks, I'm actually years into the industry, so I tend to agree with you there. But it's interesting what you say about the nature of the outage. Because again, from a practitioner's perspective, this tells me what's going on. If I lose this agent or I don't have communication from here—like you say, if it's a fiber cut—that gives me a very distinct pattern. I can see it's hit here, here, and here. If it's a regional failure within a particular cloud provider, I can see that and to what level my agents are performing or not performing.

And don't forget the fact that we have these distributed agents, I can still see this from other places. I can see region A in this location is having a problem, and I know it's just isolated to that because I can't see that across any other agents. So even these failures, to me, tell my story—which then comes back to that baseline. If I'm confident in my baseline and confident that my agent isn't putting anything on top of that, then I could be confident that the signal is telling me what’s happening from there. And then I can go through and diagnose it.

BRENNAN HILDEBRAND: Absolutely. For sure. and yeah, I mean, that's a great point because if you've got other tests that maybe aren't hitting that site, they may still route through that location, that region, what have you. And so in other tests, you can actually see the route change to get around it, you know, on the Internet, which is from my view, a very cool thing. Yeah, definitely.

MIKE HICKS: Absolutely, absolutely.

BARRY COLLINS: Mike, you've talked about how application architectures are evolving, more distributed, more API dependencies, agentic systems. Why does this make distributed monitoring more important and what's changing about what needs to be visible?

MIKE HICKS: If we go back in time, you know, what used to be visible: we used to own the entire stack. Therefore, I could instrument everything around there. And because of that, dependencies were internal and therefore predictable. We could see what was happening. We had a single network path to monitor.

Now I come from the days of even mainframes where we had this end-to-end connection from an SDLC perspective. That was the only bit I needed to monitor. I could see just that one part. So I could see everything was going around there.

What's changed with that and why we needed this distributed architecture to monitor this environment is that we have these interdependencies, we have these third-party systems that you no longer own or you don't necessarily have control over and because of that you can't instrument them. So I can't actually put agents on there, I can't sort of see what's happening, so I need some sort of external validation because these dependencies are critical to my service delivery chain and these dependencies themselves are going to cross multiple provider networks.

Just because we're utilizing one provider doesn't mean that this service we're using as part of ours is actually on the same provider as well. And then let's chuck agentic into there. This is going to create these unpredictable paths. I don't know where I'm going to need visibility. That visibility, again—for these third-party dependencies or the calls that the agents are making or the tools that they're selecting—might be on different provider networks, different environments, different cloud providers, even.

This is why I need these distributed vantage points. As we said right at the top of the show, when we're talking about the single vantage point, if I'm actually looking from one place, that is the only picture I'm going to see. I'm only going to see: “Is it impacting provider A?” If it happens to be even in some, as Brendan said, in some downstream transit provider that it will actually route through, I'm not going to see that. Because the peer relationship means I'm actually going to have the connections. And then as we talked about that issue where I have a failure within a region or a data center itself, I'm not going to necessarily see that. Or in fact, I'm going to be lights out. If I'd lose my data center where I'm actually monitoring from, I've got no other visibility.

So, I need to be able to sort of see these cross-boundary issues from multiple points at any one time. If I look at the old perspective, I was really going, is my application working? But now, I actually need to understand: Is my application working for users, on a provider, who are in Dallas? And is it from this provider that's in Chicago? Or can I actually reach it from this cloud provider that's based in Singapore—because all these are going to have different characteristics.

From where I am, I'm connected by a bit of wet string and if the sun comes out, it dries the string out. So my performance is going to be completely different to Brendan's performance. I need to understand that because we have this situation now where a degradation can actually impede performance. Application availability is effectively table stakes. It's now about, how does it perform at that point there? And it's no good me as a SaaS provider as it were, to say, yeah, everything's good from here because it worked for my desktop. My users could be distributed anywhere around the world. And it could be mobile. We have this dynamic environment. Seeing it from these multi points and be able to effectively triangulate what's going on, this is what’s the major lesson I sort of take from here.

BARRY COLLINS: From what you're seeing operationally, Brennan, how are these changes in application architecture affecting what you need to monitor and how you deploy agents?

BRENNAN HILDEBRAND: Well, first I'll take the deployment part. So far, we're not sure yet, basically. You know, we're, sort of sticking with our standard methodology, but if there's something different that we have to change, we will. As far as what we monitor, with the growth of genAI, it's a rocket ship and everybody's adopting it some way or another.

We really like to get visibility into these new mega DCs where this compute is happening. Because everybody's going to have all these customers that are making these calls into these MCP servers. And they're going to want to know, how good is the network quality between here? Is it a problem with the, the genAI that they’re probably paying through the nose for, or is it the network? Can we blame the network on this? That's really important.

The mega data centers that they're building is something that I've got in the back of my head as something that we want to keep track of and see where they're going and at least get close to, if not inside of. But even other groups might be running some AI stuff where their MCP is local but it's making calls into there. So customers are calling into their MCP. It's calling the big mega data center. We want to see that whole thing. You know, we want to be able to illuminate that entire traffic pattern so you can see where things are going wrong and if it's the network.

There are some pretty cool things ahead, I think. Exciting times we’re living in.

BARRY COLLINS: From working with network monitoring data and distributed systems, Mike, what's one thing about how the internet actually behaves that people often misunderstand?

MIKE HICKS: The biggest misconception that I think of is that the Internet is one big thing. It’s not. It's thousands of independent networks and they've all got their own routing policies and peering relationships and infrastructure.

Then the other thing you have to consider from there is that we have these invisible failures. The routes can stay advertised while the networks are of completed down and we can't get to them themselves. This is because we have these multiple networks all put together.

The other thing that I think is part of this misconception is that everything sort of works on the how can I get to there? At no one time, although we have routing tables within these devices, they don't know that entire path. I can't actually troubleshoot this from the routing tables alone. I actually need to test and validate this from different networks to actually see what's really happening.

The question really isn't: Is the internet working? The question is: Is it working on this network?

BARRY COLLINS: Brendan, just finally, looking back at your time building and running this infrastructure, what's something that turned out to be way harder operationally than you expected? And what's something that actually worked out better than anticipated?

BRENNAN HILDEBRAND: I think both of these are kind of orchestration, really. For us, because we've got so many instances running out there, it's a challenge to basically keep everything maintained, up and running, easy to maintain, especially, and easy to expand—all of those things.

When we really started pushing a lot of growth, a lot of expansion of our agents, we were looking at Kubernetes. We wanted to be very modern like everybody else does and use that. But it turns out it's quite difficult. mean, to some extent, we're an anti-pattern, we require fixed IPs for our endpoints. Kubernetes is kind of anonymous. We need to have all the non-privileged ports open to the Internet. That's a problem with Kubernetes. It doesn't do it. Some of the networking packages don't do it at all. There are a couple that will, but it's very tedious and time consuming to do it and very hard to maintain.

So, the thing that surprised us really that worked well are more classic orchestration methods. Sort of the way we used to do it before Kubernetes. Pattern and paradigm actually works well for us. Of course, we're still looking at Kubernetes. We still pay attention to it and see if there's a way we can get in there because it does simplify a lot of things if configuration and management of those configs is easy, which it hasn't been so far in this case.

MIKE HICKS: I think that's a great lesson for everything overall really, it is that we're not just pushing stuff for the sake of technology. When I talk about complexity, it isn't complexity for complexity’s sake when we're building these applications, it's so you have better service. And this is the same essentially, or what I'm hearing is, it’s the same when you actually sort of build and deploy this: “We looked at a way to improve the orchestration, but what we were doing actually was the most efficient, most beneficial to us to actually do there. Therefore, we don’t necessarily need to try and crop out another dependency or complication on top of that.”

This is a lesson again, even for the enterprises looking at how they're going to use their applications, it’s: Look at these end-to-end services, look at how we do that and what's the best way to actually achieve that. We're effectively doing that in the same way when we talk about our cloud agents.

BARRY COLLINS: That’s our show. Please give us a follow and leave us a review on your favorite podcast platform. We really appreciate it and not only does this help ensure you're in the know when a new episode’s published but also helps us to shape the show for you. You can follow us on LinkedIn or X @ThousandEyes or send questions or feedback to internetreport@thousandeyes.com. Until next time, goodbye!

The Internet Report

Inside Distributed Monitoring Infrastructures

Summary

Distributed Monitoring: Seeing the Internet from Every Angle

A Conversation on Distributed Monitoring

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

related blogs

Summary

Distributed Monitoring: Seeing the Internet from Every Angle

A Conversation on Distributed Monitoring

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

related blogs

Upgrade your browser to view our website properly.

Subscribe to the ThousandEyes Blog

Stay connected with blog updates and outage reports delivered while they're still fresh.