Operationalizing ThousandEyes Using the BSR Methodology
This is the first in a 3-part blog series on how to use the BSR methodology to ‘operationalize’ your investment in ThousandEyes with the goal of continuous improvement in IT Operations Delivery & Quality. Integrating with Incident and Event Management alone is not enough to prove the value of the network intelligence you are acquiring. Whether using ThousandEyes as the lead monitoring intelligence platform or integrating with other solutions, understanding how to derive and measure business value is critical to continued success as an agile IT Operations unit.
My name is Tony Davis, and I work as a business advisor for companies who are looking to get more out of their mountains of IT Operations harvested data. I use the term mountains because as a long time IT Operations leader at several large public companies, I have seen more and more monitoring and logging tools produce more and more mounds of data. However, I continue to ask one critical question: How much actual operational intelligence is coming from all those tools and how is being used to make improvements to the customer experience?
Maybe the best way to explain the need for such a question is to highlight the example I recently had with an IT Operations executive at a large retailer. This executive (SVP) had been plagued by an escalating number of customer complaints about doing business on their website. The complaints were not about outright outages, but rather more about sporadic slow response and a general ‘difficulty in doing business.’ These types of comments were prevalent in the notes made by customer service reps handling calls at the retailer's contact centers around the globe. The executive explained to me that he had approved purchases of monitoring software for his operational teams to isolate and eliminate the problems, but that ten months after the implementation there had been no noticeable decrease in the number of complaints or the type of issues being noted. The technologies procured were a combination of log monitoring, application performance monitoring, and synthetic test monitoring with a total financial commitment (upfront and maintenance) of just over $3M per year.
As is part of the process I use to assess IT Operational scenarios, I asked for and was permitted to spend a few days observing and chatting with some of the principal users of the new monitoring tools that the executive had approved for his team. The goal was not just to watch for what the tools were alerting on in production, but also to discern what process was being followed to transform the output data from the tools into actionable intelligence and then how that intelligence was used for a closed loop continuous improvement model. Not surprisingly, below are the key findings from those few days of observation:
As you might imagine, two of the three monitoring technologies were used in a completely agent based on-premise fashion. Not there is anything wrong with that model in general in large corporate environments, but as is common, agents had been installed on every possible server in every possible area with default collection settings. Now, there is a lot wrong with that! Following this model means that the operations team with primary responsibility for production was constantly flooded with data that did not always have anything to do with the problem the executive was trying to solve. Not only was the data flooding the operators, but it was also very disparate in nature and had been formatted by engineers who were very familiar with reading raw data on stalled Java threads and processes, packet loss, specialized search languages, and so on and so on. When I talked with this responsible team of operations specialists, it became evident that literally none of these folks had any experience with coding or debugging Java code, network flow analysis, formatting SQL queries or any other specialized queries. The assessment of how the data was being used was straightforward: the data was being held as forensic evidence in the event of something happening in production. Only then was it referenced as a reaction to the executive enquiring why there were so many complaints from customers. So there was no increase in proactive work! The vast amount of data being harvested from these monitoring tools was staying as just that: data. It was never transforming into actionable intelligence because it was endless diverse in nature and not being boiled into something that was obviously valuable. Moreover, increasing the amount of monitoring does not mean there is a finer granularity of measurement, in fact, it can mean just the opposite.
Do you think that the situation I described above is a rarity in IT Operations? I can say with absolute confidence that in the last ten years as a business advisor I have seen such scenarios not only persist but increase in frequency. How does a company such as the one I reference take these mountains of data and transform them into measurements that relate to proactively delivering a better customer experience? In part 2 of this blog series, I will continue with this case study and detail how the use of the BSR (Business Service Reliability) methodology helped this retailer make sense of their operational data in a customer vernacular. We will also discuss how ThousandEyes is a critical component to bringing the essential cloud intelligence variable into the BSR methodology.