What Deep Space Operations Can Teach Us About Agentic AI

This is The Internet Report, where we analyze outages and trends across the Internet through the lens of Cisco ThousandEyes Internet and Cloud Intelligence. This week, we’re widening our perspective on connectivity—especially intermittent connectivity, and the lessons that agentic AI systems can take away from deep space operations. As always, you can read more below or tune in to the podcast for firsthand commentary.

Beyond Connectivity: Building Resilient Agentic AI

As agentic AI moves to the edge, IT professionals face a critical blind spot: the gap between connectivity and data currency. Systems may stay connected while acting on stale data and cause autonomous errors.

In this episode, we explore the architectural challenges of deploying agentic AI in environments with limited or intermittent connectivity. Drawing parallels to the extreme constraints that engineers contend with in outer space operations, we discuss:

Decoupling connectivity from data currency: Find out why a stable network doesn’t necessarily mean its data is fresh, and how monitoring both as distinct failure modes can prevent autonomous systems from acting on outdated operational states.
Building self-awareness into AI architecture: Understand how prioritization of active verification in system design helps ensure that before executing decisions, the system checks context validity, rather than relying on passive updates.
Defining graceful degradation: Hear how a fleet-wide autonomous vehicle incident illustrates the value of establishing clear operational modes before deployment, so the system knows how to fail safely when it can no longer verify its own context.

To learn more, listen now and follow along with the full transcript below.

A Conversation on Deep Space and Agentic AI

BARRY COLLINS: Hi everyone! Welcome back to The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. In this episode we’re talking about the lessons that agentic AI can learn from deep space, with a particular focus on how to deal with intermittent connectivity. I’m Barry Collins and I’ll be hosting today with the amazing Mike Hicks, Principal Solutions Analyst at Cisco ThousandEyes. As always, we’ve included chapters in the episode description so you can skip ahead to the sections that are most interesting to you. And if you haven’t already, we’d love you to take a moment to give us a follow over at Spotify, Apple Podcasts, or wherever you like to listen.

When space engineers are designing vehicles such as the Mars Rover, they have to deal with hugely limited resources, not least that the control plane is 20 minutes away at the speed of light—meaning the vehicle has to make good decisions autonomously.

There are some lessons to be learned here for the agentic AI industry, aren't there Mike?

MIKE HICKS: Yeah, and I think this is one of those comparisons that sounds surprising until you actually sit down with it—and then you can't unsee it, once you've gone through this.

So just to be precise on the numbers there, because I think they're worth getting right: The one-way delay between Earth and Mars actually ranges from about three minutes of the closest approach to up to 22 minutes when the planets are on the opposite sides of the sun. So that means that a round-trip time can be anywhere from six to 44 minutes depending on the orbital position. But either way, the point stands, real time control is simply not an option.

So that's just not a communication inconvenience. It's actually a hard architectural constraint. The rover can't phone home and wait for an answer. It has to make good decisions or has to make decisions with the information it has and in the context it's operating in, right now. So that's not a nice-to-have capability. It's the only option.

And what's sort of interesting is that we've largely been able to avoid that architectural discipline because until recently, the two things that force it, limited connectivity and autonomous actions, hadn't actually arrived together. We had abundant connectivity so we could tolerate assistive AI that sort of phone home constantly. And where we did have limited connectivity, we didn't have systems making consequential autonomous decisions.

Edge AI changes both those simultaneously, so you're reducing connectivity and increasing the stakes of acting on stale context at exactly the same moment. Now, space engineers have been living with that intersection for decades. Enterprise AI is arriving there now.

I want to be honest about something here, because I think it's actually more of the interesting angle. Deep space hasn't actually solved this yet (or hasn't fully solved this). Their thinking is actually sort of very mature and the problem is well understood, but the implementation story is still evolving. What space operation gives us isn't a solved solution to actually copy, but it's a way of making the problem impossible to ignore. The stakes are so unforgiving that every assumption gets questioned—and that discipline is exactly what we need from the agentic AI industry right now.

BARRY COLLINS: Give us some scenarios where agentic AI agents must cope with limited or intermittent connectivity.

MIKE HICKS: Where this really matters are primarily in what we're going to call operational technology environments, always referred to as OT. And this covers mining, energy grids, water treatment, remote manufacturing, heavy industrial sites. Now the reason these environments behave differently to a standard enterprise network is sort of rooted in the way they were originally designed. These systems, the SCADA platforms, distributed control systems, the programmable logic controllers that actually run the physical process were architected decades ago around a principle called the Purdue model. Now that model creates a deliberate hierarchy of network layers with increasingly strict controls the closer you get to that physical process. The underlying assumption being, was that these were air gap security, with isolation as a primary defensive mechanism. So these systems were never designed with the assumption of continuous external connectivity.

Now what this means in practice is the connectivity in these environments is often intentionally constrained, and for obvious reasons. So they have data diodes that physically enforce one-way communications where data can flow out, but nothing comes in. And this eliminates sort of a whole class of attack vectors. But it also means that you can't push updates in real time.

There's also restricted proxies and jump boxes that control what services can be reached and when. And in some segments, synchronization with external systems sort of happens on a scheduled basis through a control window. But all of this is controlled rather than continuously.

The IT/OT convergence is sort of changing some of this. So we're seeing the industrial protocols like OPC, UA, MQTT, enabling more connectivity across the layers and edge-to-cloud architectures are increasingly common, but security architecture constraints haven't gone away. Obviously, they shouldn't, because they were there for a reason. But the result of that, then, is that the autonomous systems’ contextual data—so things like blast schedules at a mine, safety exclusion zones at an energy facility, routing priorities in a logistics operation—sort of live in what's called a historian. Now this is a local data store that captures the operational data and gets synchronized via control windows, rather than receiving sort of a live feed.

And this is where that failure mode becomes interesting. The network connection can be perfectly healthy. The historian is accessible, all green on the dashboard. But the machine is making decisions against data that reflects an operational state from three hours ago, because that's when the last sync window closed. And it's no fault anywhere in the system. The pipe is open. The historian is responding. The data just isn't current.

So standard network monitoring has no real mechanism to sort of catch that distinction because it's looking at connectivity, not currency, and that's essentially where the gap is there.

And this is where it gets sort of genuinely interesting from an operational perspective, because the driver for agentic operations in these environments isn't just about automating what humans were doing already, it's about incorporating these contextual data sources that were previously too complex to act in real time. Think of things like weather pattern data that's going to affect equipment performance and haul routes; power grid consumption data that sort of influences when processing operations can run; environmental census feeds that affect safety exclusion boundaries; the commodity pricing signals that can influence extraction priorities. Now, none of these were easy to fold into operational decisions when a human had to synthesize them manually. Agentic systems could potentially integrate all of them and the efficiency and profitability case for doing so is actually quite compelling.

But here's the tension doesn't get talked about enough. Every additional data source that you bring has its own freshness characteristics. It has its own sync cadence. It has its own latency profile. Weather data might update every 15 minutes. Power grid pricing might update every five. Equipment telemetry might be near real time. And the blast schedule that we sort of mentioned earlier might be on a two-hour sync window. The system is now making decisions dependent on all of these sources simultaneously. And each one of these has a different acceptable age and a different failure mode. That's not a connectivity problem that better protocols can serve, that's actually a data currency problem that has to be designed for explicitly.

BARRY COLLINS: There are already protocols in place to handle intermittent connectivity in those extreme environments. Tell us about those.

MIKE HICKS: The textbook answer here is there's delayed tolerant networking and the Bundle Protocol DTN. And these were specifically designed for environments where you can't assume end-to-end connectivity. Store and forward custody transfer between nodes. No expectation that path exists at any given moment. And actually, that thinking is sort of sound and it was built exactly for this kind of problem we're describing.

But the Bundle Protocol version 7 was only finalized at RFC standard in January, 2022. The implementation ecosystem is still relatively thin and if you adopt it today, you're actually going to have to build your own security, your congestion control, your reliability on top of that. So you're constantly reinventing the transport stack rather than solving the actual problem you came to solve.

Considering we're thinking about this from an AI perspective, we're trying to do this in an industry that's moving at a significant rate. Now, in my opinion, QUIC is actually the more interesting practical answer right now. I think it's really underappreciated specifically in this context.

And just before I get into the why, it's actually worth saying that QUIC isn't an acronym in a traditional sense. It actually started off its life at Google, as a shorthand for Quick UDP Internet Connections. But when the IETF standardised it, they just dropped the acronym completely, so it's just QUIC. It's a name on its own right, like Pelé, Madonna, or Cher. One name is enough to define it.

BARRY COLLINS: And I think it’s worth pointing out to the listener that it’s spelt QUIC, there’s no K on the end.

MIKE HICKS: Now the obvious question is can't TCP do this? And the honest answer is that TCP has improved significantly. It's been around for a number of years, and it's improved from many scenarios there, and it's perfectly adequate in these conditions. But we're talking about three specific places where QUIC has a meaningful advantage in exactly the conditions that we're describing here.

So the first is what we're going to call headline blocking. Now in TCP, if a packet gets lost, the whole connection stalls until that packet is retransmitted and received. Everything queues up behind it. QUIC uses stream multiplexing, which means that lost packet only affects that particular stream. Everything else keeps moving. An environment where we're dealing with a variable link quality, intermittent connectivity, that actually matters a lot.

The second is connection reestablishment. When a TCP connection drops and comes back, you need a full handshake to reestablish it. This has latency, so in a constrained environment where the link might drop and recover frequently, that overhead accumulates. And QUIC’s 0-RTT [Zero Round Trip Time] resumption means that when a link recovers, you're back up in milliseconds rather than going through this whole process again in scratch, with this sawtooth handshake as you go through there.

Now the third is the security. TCP has no built-in encryption, so you're so you're adding in TLS separately which again is another layer to design, implement, and maintain. QUIC has TLS 1.3 built in from the ground up, so in a critical infrastructure environment, you're not bolting security on as an afterthought.

Put those things together and you get a protocol that recovers fast, keeps moving under packet loss, and comes with security already integrated. It's battle tested at large scale by the biggest names on the internet and it's actually sort of deployable today. We're not waiting for standards body to catch up and we're not reinventing the wheel. So, we can then focus on the AI logic rather than the transport plumbing.

And that said, this is an important caveat. QUIC solves the connectivity dimension of problem. It doesn't solve the data currency dimension. And that's sort of harder one.

BARRY COLLINS: Okay, so let's focus in on one of those areas where connectivity is limited, sometimes intentionally as you said. Tell us about the autonomous haul truck.

MIKE HICKS: This is a really good example because it sort of makes the state concrete. Now we're talking about a piece of equipment that can cost somewhere in the region of 5 million dollars. It's operating in environment where a single operation error can be catastrophic from both a safety and in a financial perspective.

Stopping one of those trucks unexpectedly can cost in the region of $50,000 or more in lost production per hour. So there's obviously a lot of pressure to keep the machine running, which is exactly what makes the data freshness problem so dangerous. Now, the trucks’ operational decisions depend on contextual data, so: Where are the blast exclusion zones right now? What are the current routing priorities? Are there safety boundaries that have changed in the last hour?

And that data lives, as we said, in the local historian or an edge case and is updated via this sort of schedule sync rather than this live feed because of everything we talked around from an OT perspective.

Now here's where those two dimensions I mentioned earlier become critical. You have to monitor connectivity and data currency independently because they can fail independently. The network connection can be healthy. The sync can have completed successfully two hours ago, but a blast schedule update came through in 30 minutes ago and didn't make it into the local cache because the sync window already closed.

The machine has no idea of this. So from its perspective, everything is current, but it's making decisions based on a reality that no longer exists. And that's the failure mode that doesn't show up in sort of a plain monitoring dashboard there. No fault, no alert, no red light, just the machine operating with quiet confidence but on stale data. And that's effectively more dangerous than a machine that knows its lost connectivity, because at least lost connectivity triggers a response where stale data with a healthy connection is invisible until something actually goes wrong.

BARRY COLLINS: So as you explained earlier, this isn't just a niche problem for extreme environments. There is something here that applies to agentic AI more broadly.

MIKE HICKS: Yeah, it does feel niche right now because most agentic AI deployments we think of are in well-connected enterprises where the connectivity is generally abundant and reliable. And we're not necessarily ultimately thinking about this sort of data freshness. But any agentic system making an autonomous decision is implicitly trusting that its context is current.

Now most systems don't question that assumption, they don't have a mechanism for asking whether the data they're acting on is still valid. And that's fine whenever there's a human in the loop, reviewing as recommendation, but it's not fine when that system is starting to act autonomously because the error doesn't get caught before it becomes a decision.

As we said, that staleness detection problem is actually harder than that connectivity problem. And that's the point I really want to land here. You can monitor where the connection is up. That's a solved known problem. We've done that. Knowing where that data that arrived over that connection is still accurate and reflects the current state of the world, now that is different and genuinely hard to question.

And it's where most current architectures actually have a blind spot, including those ones that aren't in extreme environments at all.

And I want to be precise about what we mean by freshness here because it's richer than just knowing “how old is that data?” There are really three things you need to know. First, is the source still active? So, is it actually something on the other end that is live and authoritative? Second, is the data you receive complete? Did the task of getting the data actually finish? Or did you just get a partial picture? And third, can you actually use what you have as valid context for the decision you're about to make?

Now the reason you can't always answer those questions by checking a case or looking for a freshness flag is that those mechanisms only tell you when the data was last pushed, not whether the underlying reality has changed since then. Back to the mining, a blast schedule might show as updated 30 minutes ago, but the sync completed just before a change was authorized, that timestamp's actually misleading. The only way to be confident is to make an active call to verify the source is still alive.

Verify the response is consistent and what you expect, and only then treat that data as usable context. So if that call fails or returns something unexpected, you have options. You can try to validate from another source, you can fall back to a known safe state, or you can escalate to a human, but you need that active verification step built into the architecture. Not just assumed away.

The design principle worth borrowing, then, from space operations is this: Systems that know what they don't know are more trustworthy than the systems that are confidently wrong. Building that self-awareness into an agentic system, the ability to assess confidence in its own context before acting, is a first-class design requirement, not an afterthought, not a monitoring layer you add later, but really a foundational design requirement.

BARRY COLLINS: If you're designing an agentic system that must operate reliably in a constrained or intermittently connected environment, what does getting this right actually look like?

MIKE HICKS: So, the first thing, and this sounds obvious, but it is frequently skipped, is you need to monitor the connectivity and the data currency of two separate independent failure modes, with separate streams, effectively. Typically, the network team would own that connective dimension. They also now need to have an overview specifically on that separate stream of what's happening with the data currency. So even though you're monitoring those two, you need to observe them as a single entity.

And second, define that stateless thresholds per data type before we actually deploy, not after your first incident. I sort of touched on this earlier, but it's worth being specific because people often treat this as a single policy decision when really it needs to be done at the data-type level. A geological survey might be valid for days; routing priority data might be valid for an hour; a blast schedule might be valid for 30 minutes; safety exclusions, though data, has no acceptable staleness at all. And importantly, those thresholds need to inform the active verification cadence. How often am I making the call to verify the sources live and current? And that should be a direct function of how quick of that data type ages there.

Third, formalize that degradation path we described, so: Full autonomous operation when data currency is high and connectivity is healthy. Cautious autonomous when currency is degraded but within acceptable bounds. Human loop when confidence drops below a defined threshold. Safe stop when the system can't verify its context at all. The key word here is formalized.

This isn't something you want to be working out in the moment when something goes wrong. It sort of needs to be defined, tested, and understood by the operators before that system goes live. And the organizations that get this right are the ones where operators can look at the system state and immediately know which mode it's in and why, not the ones that find out what degradation behavior looks like after their first incident.

BARRY COLLINS: Are there learning points here for the regular autonomous vehicles that are increasingly appearing on our roads. Surely, they must have to fail gracefully if connectivity is suddenly lost while they're driving down a highway or navigating heavy traffic.

MIKE HICKS: Yeah, and this is a really a great example because it's something people can picture immediately. The autonomous vehicle case is actually really interesting, because it actually handles the connectivity dimension really well already.

Most modern autonomous vehicles are designed to operate primarily on what's called local sensor fusion. So, that means combining data from multiple onboard sensors in real time—things like LiDAR, which uses laser pulses to build a precise 3D picture of the surrounding environment, cameras for visual context, and radar for detecting speed and distance. The key point is that these are all on board and local. The vehicle isn't depending on a live cloud connection for moment-to-moment decisions about what's in front of it. The architecture explicitly accounts for the possibility the connection is going to drop because it has to, because we can't be left in this state.

But the data currency dimension is actually more nuanced. The vehicle might be operating perfectly well on local sensors, but its map data—its understanding of road layouts, construction zones, updated traffic patterns—is only as good as the last time it was synced with that external source. So if something significant has changed since the last update, the vehicle doesn't know what it doesn't know. It's operating with a confidence on a model of a world that may no longer be accurate.

Now Waymo is worth mentioning here because they're probably the most mature autonomous vehicle deployment that we have at scale right now. And there was a really instructive incident in December, 2025 that I think really illustrates this point perfectly well.

There was a fire at a power substation in San Francisco. It knocked out the electricity to a third of the city, taking down traffic lights across large areas simultaneously. And when these traffic lights failed, they didn't have battery backup, so they went dark. Now, Waymo's vehicles are designed to sort of handle these dark traffic signals correctly. They treat them as four-way stops, which is what the highway code says. But in this case, they didn't handle the majority of them correctly because there was somewhere in the region of 7,000 dark intersections that day.

The failure was something more subtle and really quite interesting. Waymo’s system allows vehicles to request human confirmation checks from the remote operations teams if they're uncertain about a situation. So that's that data validity check we're talking about. With all these dark intersections appearing simultaneously across the fleet—I think it was something in the region of 4,000 cars—there was a concentrated spike in those confirmation requests, that sort of overwhelmed the service desk.

So, the vehicles that actually requested confirmation didn't get anything back because of this spike and they just sort of sat there. They didn't have the context on which to act, so they assumed it was a catastrophic decision and their failure mode in that way was to stop dead. So then, they had all these vehicles blocking normal traffic trying to get through that could see visually what was going on there. What that means in our terms is: That graceful degradation mechanism, the human loop layer, wasn't designed to scale to a simultaneous wide event and it actually didn't have the ability to understand that individual behavior.

So if we had this situation, and we had the human loop in there, what we could have accounted for was this understanding that: Can we actually get to the control center? Yes, we can get to it. Let's do an API call back to there. We can actually get to that. So therefore, we verified the context is there, so it must be something else. At that point, maybe go and seek some other information, maybe the power utility to understand, ah there is a failure in there. And then they could have just proceeded with their onboard sensors to understand “we can actually just proceed with caution” at that point, as opposed to coming to a stop.

BARRY COLLINS: What can we learn from this, Mike?

MIKE HICKS: I think the broader point here is that this is really about where these principles sit in design process. Data freshness, disconnect tolerance, graceful degradation: these aren't features you add to an agentic system. They're foundational constraints, closer to guardrails that have to be established before any other architectural decisions.

So, if you treat them as foundational, you're going to end up with a system that's structurally capable of operating reliably in the real world. If you treat them as things to address later, you end up with something that works in the lab but then fails in production, and in ways that are genuinely hard to fix without starting over. And that's not a retrofit problem, it's an actual architectural one. I think it needs to be designed in that foundational start as we move forward.

BARRY COLLINS: That’s our show. Please give us a follow and leave us a review on your favorite podcast platform. We really appreciate it and not only does this help ensure you're in the know when a new episode’s published but also helps us to shape the show for you. You can follow us on LinkedIn or X @ThousandEyes or send questions or feedback to internetreport@thousandeyes.com. Until next time, goodbye!

The Internet Report

What Deep Space Operations Can Teach Us About Agentic AI

Summary

Beyond Connectivity: Building Resilient Agentic AI

A Conversation on Deep Space and Agentic AI

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

related blogs

Summary

Beyond Connectivity: Building Resilient Agentic AI

A Conversation on Deep Space and Agentic AI

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

related blogs

Upgrade your browser to view our website properly.

Subscribe to the ThousandEyes Blog

Stay connected with blog updates and outage reports delivered while they're still fresh.