This is The Internet Report, where we analyze outages and trends across the Internet through the lens of ThousandEyes Internet and Cloud Intelligence. I’ll be here every other week, sharing the latest outage numbers and highlighting a few interesting outages. This week, we’re taking a break from our usual programming for a special conversation on cloud monitoring best practices. As always, you can read more below or tune in to the podcast for firsthand commentary.
The Core Areas CloudOps Should Monitor
While it would be great if cloud monitoring best practices could be boiled down to a simple list of key metrics cloud operations teams should watch, the job isn’t that simple. Metrics without context are of limited value. It’s only when we think about cloud monitoring as a big holistic system that we can make a positive difference to performance.
Gone are the days when IT infrastructure could be monitored through a linear, client/network/server lens. Today’s cloud environments are multi-dimensional meshes of interconnected services and infrastructures. CloudOps teams need visibility across not only the cloud resources, but every element involved in delivering services to end users. This includes applications, internal and external networks, transit providers, and the cloud providers themselves.
Here are five key cloud monitoring best practices that CloudOps teams should focus on.
1. Set Service-level Objectives
Rather than relying on individual metrics such as latency or packet loss, CloudOps teams must focus on the real end-user experiences. This requires combining signals to actively monitor end-to-end performance, helping ensure that the service delivery chain consistently meets or exceeds user expectations.
That, of course, means monitoring not only your own systems, but those of the cloud providers, ISPs, transit providers, and application providers. Each of those different providers may appear to be functioning optimally, showing green on their status pages, but their combined performance can sometimes lead to a suboptimal end-user experience. For instance, if part of an application is hosted on a cloud service on the opposite side of the world to your users, it may introduce unnecessary latency. There’s no fault with the service itself—just the way it’s plugged together for that particular use case.
That’s why it’s important to set some sort of benchmark—a service level objective—for each of your end-user experiences. What do you need to achieve to deliver this service to match your business requirements and your users' expectations? That can only be done if you’re looking at it holistically across the whole environment.
2. Avoid Isolation in Decision Making
A common mistake made by CloudOps teams is to make infrastructure decisions based on isolated metrics or costs.
You might, for example, be tempted to move a workload to a different region to save on cost, but basing a decision on price alone may be problematic. It's important to consider the overall impact on the entire service delivery chain, not just that specific workload. Will this move negatively affect performance, potentially costing the business more in lost productivity over time?
On the other hand, after examining your end-to-end service delivery chain, you may decide that you can live with the few extra milliseconds of latency that’d result from shifting the workload—it won’t have any significant impact on your users. Or you may even be able to eliminate a bottleneck to mitigate that added latency. Such decisions can only be made confidently if you’re constantly looking at the bigger picture, not merely at micro-level cost savings or metrics.
3. Streamline Service Delivery
Given the inherent complexity of distributed cloud environments, it’s essential to streamline service delivery where possible. CloudOps isn’t a “set and forget” process—continuous monitoring is required to identify performance bottlenecks or single points of failure.
That means having access to the right information. With so much detailed network-level data to take in, CloudOps teams should consider setting up context-rich alerts to help them swiftly identify the root cause of performance issues. They also need to be able to see information that is outside of their realm of responsibility, but pertinent to the problem they’re trying to solve.
Automation can play a role here. If a BGP hijack is taking place, for example, CloudOps teams don’t necessarily need to know the precise forensic details. They might have an automated process set up which automatically advertises their network on a different prefix if it detects the effects of a hijack, allowing them to quickly reroute traffic and minimize impact on end-user experience.
4. Consider All Connections
It’s not only their connection to cloud providers that CloudOps teams should be monitoring, but connections between the cloud providers themselves. Teams frequently overlook how different cloud providers connect with one another before making decisions on where to place their workloads. Missing these critical insights can lead to poor planning, increased latency, unexpected costs, and reduced user satisfaction.
If you fully understand your service, what components are involved, and how they are all connected together, you can make an informed decision. Do you go with a hybrid environment? Do you store some data locally? Do you split workloads between cloud providers or regions? These are the different types of decisions you can make when you see the full picture.
5. Prepare for Generative AI
Generative AI services are expanding rapidly, and they create new challenges for CloudOps teams. Generative AI introduces different application architectures and unique latency requirements, and also raises important considerations around data sovereignty and residency.
This requires careful planning before you rush headlong into the AI boom. Teams need to make sure that AI workloads are optimally located, not only for latency but also for regulatory compliance. They need to understand the data flows, how the network is connected together, and how fast data can get between these two points.
AI workflows can be very data intensive, so it may be wise to bring any significant workload as close to the end users as possible, maybe even hosting in a local data center that eliminates many of the latency and transmission risks. However, this localized strategy is not without its challenges, including potential scaling issues and increased complexity in management.
Generative AI adds another layer of complexity to an already intricate environment. It demands special attention, not to be treated as just another app.