When something goes wrong with a service you depend on, your first instinct may be to head for the company’s status page. Rarely, however, do these dashboards contain all the information you’ll need to successfully diagnose an outage.
A tendency to report individual component issues rather than take a holistic view of the entire service delivery chain, a lack of real-time reporting, and a possible absence of detailed information on the root cause of an outage make it difficult to rely on status update pages alone.
While a vendor's status page might be a good starting point if you don’t have any other intel about an outage, ideally it won’t be the only data point you consider.
In this blog, we’re going to examine some of the limitations of status pages and explore techniques for accurately identifying what’s gone wrong so that you can minimize downtime.
The Drawbacks of Detailed Dashboards
When you visit a status update page and see a long list of different components, each with a traffic light indicating their current status, it’s easy to believe you’re looking at a real-time, automated reflection of the service health. That’s often not the case.
Firstly, by splitting the status update into different service components, it can give a false impression of how serious an outage is. You may look at a dashboard and see just one or two components with an amber warning light and be reassured that nothing major is going on, which might deter you from taking urgent incident response steps.
However, the dashboard is only reflecting the status of those individual components, not the overall service delivery chain. For example, an online accountancy service might have different status updates for user authentication, invoicing, reporting, reconciliation, and many more components. In almost any outage, the vast majority of those components will be displaying green to indicate they are operational, but if the user authentication component has failed, it doesn’t matter if the rest of the service is working perfectly, you still won’t be able to get through the front door to access it in the first place.
To be fair to the service providers, there’s a good reason why they don’t want their entire dashboard flashing red if a key component, such as authentication, fails. It doesn’t help anyone identify the cause of a problem if the entire dashboard is a sea of red warning alerts. It could cause unnecessary panic. But there’s also a reputational issue to consider here, too: No company wants to make an outage look more dramatic than it is, even if the failure of a key component effectively drags down the entire service.
Status pages break down services not only by component but also by region. If you’re a European customer of a cloud provider and its status page shows problems in Asia Pacific, you might not be concerned. However, an issue in one region may impact the overall performance of the service—for example, if it has multiple read regions but a single write configuration in the affected area.
These are only two examples of how a status update page can lull you into a false sense of security, but they’re not the only reasons why it’s risky to rely on status updates alone to guide your incident management.
No service provider has full responsibility over the entire digital service chain for you or your users. They may be entirely accurate in reporting that their own service status is fine, but an issue at a connection pairing point may leave a region out of your reach. Without having clear visibility over your entire delivery chain, you’ll never know what’s going on if you only turn to status pages.
Get more insights on troubleshooting digital experiences across owned and unowned networks. Watch the webinar.
Timely Updates?
Another factor to consider when visiting status update pages is the timeliness of the information. You might think that status update pages are fully automated, with a vast array of sensors delivering real-time information on the health of the many different components and regions. That’s often not the case.
There’s frequently a significant lag between what the service provider sees in its internal reporting and the notifications it reports publicly on the service status dashboard. It can be minutes or even hours before an outage is accurately reflected on a company’s status page.
We know this is the case because we often see discrepancies between a company’s live status updates during an outage and the timelines detailed in the post-incident reports they publish after the event. Companies may revise timelines to more accurately reflect the actual time an incident started, showing that problems started earlier than was first reported on the status page—perhaps because the issue couldn’t be fully confirmed at the time.
Even when a company is aware of an outage and its likely cause, they may hold back from providing detailed information on the status page for various reasons. Companies are cautious of disclosing specifics about a problem before they have fully diagnosed the issue. This means you will often see statements such as: “we are aware of an issue and continuing to investigate the cause” on status pages.
Additionally, full disclosure may have security ramifications. Releasing details publicly during incident communications could highlight a potential vulnerability that could expose the company to attack, exacerbating an outage. There could also be legal ramifications for appearing to admit a flaw in their systems, which companies will naturally be reluctant to do.
All of this means that NetOps teams simply cannot rely on isolated status updates to accurately diagnose outages. So what should they be doing instead?
Linking the Chain
It only takes one component, or even just one function, to fail or degrade in order to bring the entire service delivery chain to a halt. When a disruption happens, it's important to efficiently determine the source—and a critical step in this process is identifying what isn't causing the problem. This is best achieved by looking at the entire service delivery chain that considers everything from the device to the app. Teams need to understand what’s linking these various components together.
To do this, you can’t rely on status pages alone. You might get lucky and be able to correlate an error across multiple status pages, but they will likely have inconsistent timestamps or inconsistent reporting that makes correlation impossible.
Instead, you should look to combine signals from all the resources and metrics at your disposal. For example, if your network’s internal analytics show an increase in HTTP errors and a provider’s status page shows a potential issue with its API, those two different amber signals could combine to create a red flag for you because it has taken down a critical function.
By combining those signals and looking at the resulting metrics in your visibility platform, the status page starts to fulfill its role in life: helping organizations understand if a problem is affecting everyone or just them. It’s crucial to understand your own fault domain, and it’s often the case that only by analyzing these different signals in context can you get an accurate picture of what’s going on.
Status pages can be a useful tool for understanding the current service health and determining whether something like planned maintenance might be going on. However, they’re far from infallible. The information they provide can be delayed, vague, and limited in scope. And they won’t give you the full picture with all the detailed information you need to quickly and accurately identify the root cause of outages.
By deploying a more comprehensive monitoring approach and supplementing status page data with additional sources, you stand a much greater chance of getting the answers you’re looking for and not jumping to conclusions about the cause of an issue.
And it’s not just about identifying the issues but avoiding them next time, too. With automated solutions in place to deal with problems you’ve accurately identified in the past, you can proactively maximize uptime for you and your customers in the future.