This is The Internet Report, where we analyze outages and trends across the Internet through the lens of ThousandEyes Internet and Cloud Intelligence. I’ll be here every other week, sharing the latest outage numbers and highlighting a few interesting outages. This week, we’re taking a break from our usual programming for a conversation about steps organizations can take to assure performance as they integrate AI more deeply into their IT environments and adapt to the ever-changing AI landscape. As always, you can read more below or tune in to the podcast for firsthand commentary.
Assuring Performance on Your AI Journey
As organizations invest more heavily in artificial intelligence, many are discovering that their existing IT infrastructure isn't quite ready for the advanced, resource-intensive workloads AI requires. To capitalize on AI’s potential, not only must teams evolve their IT environment to support its unique needs, they must also be prepared to assure the performance of this increasingly complex web of services and dependencies.
In this episode of The Internet Report podcast, we explore the challenges of delivering quality digital experiences as AI becomes more integrated into IT environments, covering key digital resilience strategies and considerations for IT operations teams.
We'll discuss:
- The Resource Demands of AI: AI workloads are significantly more resource-hungry and dynamic than traditional applications. Think of it like cooking—you might confidently prepare dinner for six friends, but scaling that same recipe to serve 600 people requires industrial kitchens, professional equipment, and completely different processes. Similarly, organizations that successfully ran AI pilots are discovering that production-scale deployment demands scalable infrastructure architecture across hybrid environments that might include on-prem data centers, private clouds, public clouds, and more. Data sovereignty and cross-border data transfer regulations add another layer of complexity.
- The Shift to Dynamic Distributed Architectures and Edge Computing: To minimize latency and meet regulatory needs, AI benefits from being closer to data sources and users, driving the adoption of dynamic distributed architectures and edge computing that can adapt in real time to changing workload demands, data patterns, and computational requirements.
- Evolving Digital Resilience Needs: AI introduces new complexities, with unpredictable and dynamic performance patterns that can shift based on data inputs, model decisions, and real-time interactions. Unlike traditional applications with predictable resource usage, AI systems might suddenly consume vast computing power in one area while remaining idle in another, and these patterns can change from hour to hour. This dynamic behavior, combined with increased dependence on third-party services, creates a web of shifting dependencies. Additionally, AI's heavy reliance on the network means mere connectivity isn’t enough; quality performance matters more than ever. Delays or data loss due to network issues can have significant consequences, making quality service and low latency crucial.
- The Move From “Break-Fix” to Proactive Optimization: In this new reality, even subtle performance degradations can have big impacts, because of AI's dynamic and adaptive nature. Unlike traditional applications with predictable failure patterns, AI systems can exhibit performance issues in unexpected ways, and a single failure can simultaneously cascade across multiple, intertwined business processes. This makes it critical to catch issues early. ITOps teams need to shift from a reactive "break-fix" model to a proactive, predictive approach to maintain optimal digital experiences.
- Why AI Is a Journey, Not a One-time Upgrade: AI readiness is a continuous journey, not a one-time upgrade. Organizations must design flexible infrastructure that can adapt as needs continue to evolve. Scale your AI usage as makes sense for your organization, allowing AI and traditional workloads to coexist rather than trying to replace everything at once.
To learn more, listen now and follow along with the full transcript below.
A Conversation on AI, Digital Resilience, and Evolving ITOps Best Practices
BARRY COLLINS: Hi, everyone. Welcome back to The Internet Report, where we uncover what's working and what's breaking on the Internet and why.
This week, we're talking about steps organizations can take to assure performance as they integrate AI more deeply into their IT environments and adapt to the ever-changing AI landscape.
I'm Barry Collins, and I'll be hosting today with the amazing Mike Hicks, Principal Solutions Analyst at Cisco ThousandEyes. As always, we've included chapters in the episode description below so you can skip ahead to the sections that are most interesting to you. And if you haven't already, we'd love you to take a moment to give us a follow over at Spotify, Apple Podcasts, or wherever you like to listen.
AI is transforming industries at a rapid pace, yet many organizations are realizing that their infrastructure may not be ready to support its advanced workloads. Why is this?
MIKE HICKS: So businesses started with this confidence where they were sort of focusing on strategies and sort of small experimental environments. Then when they actually started to try and scale it up, that's when they hit problems. So they're actually confident of making AI work, but instead then they actually started to scale it up. Think of it like you're cooking for a dinner party. So you have sort of six friends around, you think you're a great cook, a great chef, then you decide to take that into a commercial one.
All of a sudden you need industrial kitchens and professional equipment, I need sort of a completely different process to serve 600 people versus six, and this is what they started to face. So facing these questions such as, so can my infrastructure handle additional AI workloads, what other investments do I need to make? And these are sort of things that are keeping them awake at night. And when you think about this, when we think about what's happening here, the scope is actually, from an infrastructure perspective, is far wider.
So we're not just sort of dealing with one particular element. So here we're sort of dealing with on-premises data centers, public clouds, private clouds, and these are total hybrid environments, so all of a sudden the landscape's completely changed. Then you just start to layer on top of that data sovereignty. This adds another layer of complexity, so they need to navigate where the AI training data is going to be stored and where it's going to be processed and if you sort of put the regulations on top of that, cross-border data transfer restrictions, you know we're really talking about all these different things that come in.
That's talking from a general perspective, but if we then come into the verticals where you're considering things like the financial or the healthcare, then we have additional AI infrastructure constraints, you know, due to these data residency type requirements. So we have all of these different things that have to come into place, which is really starting to scare people when they start to look at these workloads, because this is something they've never dealt with. And then we've got on top of that that the actual workloads themselves are significantly more resource hungry than traditional applications.
So we're dealing with these different models. We're dealing with a different compute model, a different distribution model, and a different interaction model, and all of this is dynamic. We don't have these fixed predictable patterns. If we start sort of thinking into a true agentic perspective, we're making adaptive changes based on the information, so we need to understand what's happening.
All of this is why the industry is sort of changing at this rapid pace. We're starting to deal with this complete new infrastructure with all these elements thrown in at once.
BARRY COLLINS: What steps can organizations take to assure performance while integrating and deploying AI?
MIKE HICKS: So the first thing is we've talked about this changing landscape, so we need to better understand where that is from an infrastructure perspective, understanding our entire digital footprints from the users of the application, I need to be able to understand all these different third-party dependencies. We can't just consider these assets to be static, so we actually need to be able to do this within a dynamic environment. Then we need to consider a look from where the AI and the non-AI workloads are going. So we're not completely going to an AI-only world, we still have these non AI workloads. So where are they going to be located? Where are our resources going to be located?
Can they coexist in terms of sharing resources? How are they going be accessed? Where are the user base? Are we going to have any impacts to those?
And then you really need to consider this scaling strategy. Remember, we're dealing with this dynamic environment. We're never going to have a set and leave scenario in here. So we've actually moved away from that, where we're now really starting to say, okay, I need to scale horizontally and vertically at the same time.
I need to stretch out, I need to Stretch Armstrong, and stretch these things out from there. But I need to be able to do that dynamically because I'm not necessarily needing those resources all at once, it's when something's coming in. And more importantly for all of that, and this is where that visibility across my entire infrastructure starts to come into real play, is that I need to make sure that these are always available. So we're coming from a resilience perspective, where it's not just that they're up, but they're actually performing.
So because if I actually drop one part of this service delivery chain now, it doesn't necessarily mean that the system is going to fail, it might mean that I miss valuable data that allows me to make another decision.
BARRY COLLINS: In addition to hybrid environments, you mentioned that distributed architecture will also likely be part of many organizations' AI strategies. Why is that and what are the performance challenges to consider?
MIKE HICKS: The why of that is really because of what we've talked about in terms of we're going to have this distributed architecture, we have all these different issues that come in place. At a real base level, AI will always perform better when it's close to the data sources and the users themselves. So we're going to reduce latency to reduce that response time. It's no good waiting for response there. So what that drives is this edge computing. So we're going to have this moving out to the outside. When we talk about the hybrid type environments, also consider again considering where our data sets are going to be.
There's this evolution towards like a data center in the box where you need to improve your ability to deploy AI models to the edge devices for better performance but also to meet some of the regulatory requirements you actually have. And then you have this performance challenge, complexity. So now we have this multiple distributed environment. We have these edge compute resources, we have some centralized, we have some in various clouds, we have different parts there, we maybe have a user base distributed, we might have one agent that's actually in a different part of the world, so we have to sort of manage performance across this multi-distributed environment as well.
So we need to be able to understand to put that together. So we're no longer now monitoring individual elements. So we need now to have this, not just a holistic perspective, but this in-context perspective, so that I can dynamically adjust to where it is. It's no good me just suddenly saying okay, yep, I can actually get to that point, it's available, it's available, it's available, it's available, if all of a sudden my data set's moved, or I've actually done replication, or I've actually gone to a different agent because of some of the input I've actually got from there. So I'm no longer now monitoring to that point.
It's like essentially sort of monitoring to a streaming service and saying okay, the frontend's up, the frontend's up, I can get to that. But what's happened in another part is the actual stuff that feeds that content down has gone down, so it doesn't mean anything at all. And the other thing to this as well is we've got to this stage now, we've always said the network's important and the contextual glue that allows all this to run, but this is now more than ever. So now we have this real heavy dependence on the network, and it's actually shifted from this can we have connectivity to can we perform?
And this isn't just data integrity in terms of sort of the quality of the data that's coming across there. We could have delayed responses, so it's performance as well that starts to come in. What I'm saying there is it's actually a delay or a loss or some sort of quality issue when we're retrieving some part of the data. Then we have this issue where we've either missed a subset of data, or we're lacking information or we've gone into a halt state on a particular process as well because we physically couldn't receive it.
So effectively if we're talking about connectivity, that is now table stakes and its performance and the assurance of the delivery of the service on top of that in this dynamic environment which becomes critical.
BARRY COLLINS: Tell us more about how AI changes how we think about digital resilience.
MIKE HICKS: AI introduces unfamiliar scenarios where these systems have to interact together and where traditional resilience and planning didn't necessarily account for. So we have these unpredictable performance patterns so the decisions create these unexpected system behaviors. We may see something that consumes resources in one particular area but not in another and that may shift from day to day from hour to hour. So we have this constantly shifting dynamic pattern that we've had that we've never seen before because of all these moving parts.
On top of that then, you have sort of these third-party dependencies. So now we're even more reliant on these external networks and these services because it could be a crucial function within that system. So yes, I might have my main source of my data within one particular area, but if I need to go and pull a resource that's from somewhere else that has this critical information as well, I need to get that in a timely fashion. I need to be able to pull that data down.
These are outside of your control. On top of this, when we're thinking about this resilience, like I said, if we're actually thinking beyond the connectivity, we've got to be able to understand and effectively mitigate these cascading failures that can occur.
So if you have a system failure in one part, where we've seen it before when we're dealing one particular application or one service, it may stop the service working or may have a particular function with that service working. Obviously, it has an impact.
We have these dependencies and disparate dependencies and what I mean by disparate dependencies, these are potentially agents that you only call once in a process or you might not call on a regular interval. So therefore that isn't part of your daily service delivery chain but it becomes dynamically part of it. You now have this importance, I need to know that that is up because when we use it, I want to know it's there. So at some point it's actually been part of our service delivery chain, so therefore it's separated from my overall service on a regular basis or what I consider to be a regular basis, but I still need to know it's going to be there when I call it.
When I make that request for information, we actually start to get it back. And because of these interactions there, we have potentially this broader impact scope, so we can now affect multiple business processes simultaneously. Because traditionally, when we're exceeding a service, it might be an ERP or a CRM type of service there. Yeah that might be used by one department or one area or we have the broader ones when we talk about Office 365, for example.
But now, because of the way this system is set up, they're effectively all intertwined. So I could have a single failure that could almost impact several businesses, whereas in the past, like I said, we might have actually just hit this one group of users. The impact's footprint has the potential to be much wider.
BARRY COLLINS: It sounds like spotting performance issues can be much more difficult in this new AI era.
MIKE HICKS: So because of all these different interactions that we're going to have with all these systems, it actually makes detection of issues far harder now or increase that complexity. Because we have not essentially looking for hard downs or outages or even these functional failures, we might have one particular system that's sort of running slowly and therefore then we're not actually reaching that agent or that set of data we actually need to complete the task.
We now need to sort of measure again across this performance perspective. So we're not assuming just connectivity or looking for this large out, we're looking for these slight degradations. So this is where you now need to start to talk about the quality of the data in terms of the performance of delivery and then really thinking about sort of what we're actually getting out from that system as well. This ultimately shifts everything from this break-fix mentality.
I can't therefore afford to be in a break-fix world because once it's actually gone down, because of that large footprint potential I have, because of the slowdown, the degradation in performance, the quality of data issues that might occur, as well as the cascading sort of stuff there, if I actually wait for that to happen, the impact could be far great. So what I've got to really start to do is to move truly into this predictive state. So I really start to understand, well where is it going to go? We talked about the dynamic nature of this, so what do I actually do in this environment?
So this is then why I'm talking about why you need to have this visibility into these disparate dependencies as well at the same time.
BARRY COLLINS: Tell us more about performance problems that can occur if organizations aren't AI ready. What type of outages or disruptions might companies experience?
MIKE HICKS: The one that really comes to mind is that cascading performance degradation. So we have this slowdown sort of coming through. This could be because we have ineffective resources or because what we've actually done, we've just done a lift and shift, when we're actually thinking about this, and we're starting to get sort of these delivery failures themselves. And again, because they start to come across the whole environment there.
So then, you know, what can actually cause these? Because we have all these different parts around there, we have potentially sort of resource starvation. So if we think about an AI workload, it can monopolize the CPU, the memory, and the storage, sort of starve the traditional applications. So we might be running along very well with our AI system, but all of a sudden this coexisting traditional application as it were, is starting to have this degraded performance.
And this might be even an element you're actually calling that again you have no control over. So you have this sort of effect that you have no visibility into or way of controlling is starting to run slowly. What this essentially leads then to is instability in infrastructure itself. So if I'm actually unprepared, I haven't been able to scale this properly, then I could have crashes.
The trouble with a crash is that it's going to have this ungraceful shutdown and you also then we're not talking about monolithic architecture, we're now talking about this distributed architecture even for traditional applications. So what that then happens is then we might lose a part, so the recovery process to that takes longer because we might have to go and rebuild the database to actually get the system back online, restart a system itself. So it becomes really important then from this scaling perspective to understand your compute resources alongside how your system's performing, to have that direct correlation between the two.
So when we think about from an end-user perspective, essentially what they're going to face from all of this is a slow application or failed transaction or system unavailability. So then obviously this impacts the overall production itself and it might not be on specifically the AI application or the tasks we're actually looking from there, as they could be this adjacent application or this business application that actually need to use. This has moved on from when we're talking about resilience in terms of keeping the lights on, this actually keeping the lights on and things performing, so productivity itself.
BARRY COLLINS: Finally, Mike, do you have any tips to leave us with?
MIKE HICKS: Top of mind is AI readiness is a journey not a destination. So continuous adaptation is essential; it's not a one-time infrastructure upgrade, you're going to have to constantly look at this and sort of change it. Then I'll say embrace hybrid thinking. So plan for AI and traditional workloads to coexist rather than replacing everything at once.
Look for that sort of scaling environment. Scale as you go, so test before you scale, implement these testing processes, understand where the AI workloads are before moving from pilot to production. Don't just think I can actually do this straight change, this is not like we've done before. But do learn from your cloud transformations, so apply these lessons we learned from the cloud migration.
We sort of saw that a lift and shift doesn't actually work, think about the functions, think about the architecture, and sort of adopt those as you go to an AI strategy there. Then I'll say sort of future proof your flexibility. So build your infrastructure that can adapt. We talked about that East/West and that North/South elasticity, so adapt to an unknown future.
We don't know where it's going to be, but we have to be able to not lock ourselves into what's happening now, but be able to adapt to sort of AI innovations requirements as they come up.
By the Numbers
Let’s close by taking our usual look at some of the global trends ThousandEyes observed across ISPs, cloud service provider networks, collaboration app networks, and edge networks over recent weeks (June 16 - July 6).
Global Outages
- Following the elevated levels observed in mid-June, global outages began declining over the subsequent weeks. From June 16 to 22, ThousandEyes recorded 252 global outages, representing a 33% decrease from the 376 outages observed the previous week. This downward trend accelerated during the week of June 23-29, when outages dropped further to 208, a substantial 17% decrease from the prior period.
- The declining trajectory continued into the next week (June 30 - July 6), with global outages falling to 150, representing a significant 28% decrease from the previous week. This three-week consecutive decline brought global outage levels to their lowest point since the tracking period began, marking a 60% reduction from the peak of 376 on June 9-15.
U.S. Outages
- While this steady three-week decline wasn’t reflected in the United States, by the end of the period, U.S. outages had also reached their lowest levels in a while. During the first week (June 16 to 22), U.S. outages decreased to 107, representing a 21% drop from the previous week's 135. However, this downward trend reversed during the week of June 23-29, with U.S. outages surging to 128, representing a notable 20% increase.
- During the third week in the period (June 30 to July 6), U.S. outages resumed their decline, dropping to 78, a 39% decrease from the previous week. This brought U.S. outages to their lowest level since early June, when 77 outages were recorded during the week of June 2-8.
- Over the period from June 16 - July 6, the United States accounted for 51% of all observed network outages. This elevated representation suggests that, while global outages were declining overall, U.S. infrastructure challenges persisted as a significant factor, accounting for a disproportionately high share of worldwide network disruptions throughout this period.
Month-over-month Trends
- Global network outages decreased from May to June 2025, dropping 34% from 1,843 incidents to 1,219. This decline is expected as summer begins in the Northern Hemisphere, a time when network maintenance activities typically lessen.
- The United States exhibited a similar pattern, with outages decreasing from 516 in May to 478 in June, representing a 7% decline. This downward trend aligns with seasonal expectations, as infrastructure providers often scale back maintenance activities in the summer months in the Northern Hemisphere.