During ONUG NY 2019, David Mann, Senior Director of Global Network Services shared how McGraw-Hill is transforming the quality and consistency of the digital experience it provides to students and educational institutions.
Recognizing that Monitoring Digital Ecosystems is Essential to Reliable Cloud Delivery
Achieving excellent online delivery hasn’t been simply an issue of cloud deployment, explains Mann. Just as important to McGraw-Hill’s digital experience success has been understanding the complexity of the Internet, an ever-changing environment with multiple dependencies that can make or break digital delivery. “We’re a publishing company and we’re also a software company,” he states. “Many of our customers will access our content, not just as a physical book but as software apps hosted up in the cloud.” He emphasizes that McGraw-Hill is highly dependent on services that sit outside its in-house environment. “You’re in a situation where a lot of different components are involved in making sure your digital experience is great—there’s DNS, CDN, you’ve got different cloud providers, your service provider—and think of all the service providers in between. Not all of those you’re responsible for...and many of those companies don’t know they’re providing that service for you—and you’re not paying them either.”
Historically, this kind of digital ecosystem represents a "black box" to IT and network teams monitoring efforts.
Enter ThousandEyes: Demystifying how Internet Topology Impacts Cloud Performance
McGraw-Hill’s decision to deploy ThousandEyes for its cloud and Internet monitoring means that Mann and his team can rapidly diagnose the cause of a digital issue. Before ThousandEyes, when a classroom experienced issues with the service, it could take a long time for the customer to show them what the problem was.
Apart from helping to solve the initial problem, identifying the underlying cause of a specific incident can also have broader long-term benefits for a customer. Mann recalls a recent Internet outage example, “We were able to deploy a ThousandEyes Endpoint Agent to the customer, and the test revealed that the customer had an internal DNS misconfiguration that was causing a periodic problem. We could take that data and provide it to the customer, and it led to a totally different conversation. We told them that ‘the issue is with your internal DNS server, not the public one.’ The customer called us back later and said that it turned out that they were having problems with their entire Internet performance, before we helped them to diagnose and pinpoint the scenario. Now we have loyal customers and they see that we’re a partner in this, and not just a vendor for them.”
NOC Crowdsourcing: Breaking Down Internal Silos
The ThousandEyes platform’s ability to share digestible snapshots that visualize the causes of outage has encouraged a more open and collaborative approach to problem-solving within McGraw-Hill. “There’s been an ecosystem developing within McGraw-Hill where all these various departments now want to look at the network,” explains Mann. “We create dashboards for our departments. We’ve opened up the TE dashboard and our tools to the rest of our internal organization. We have nothing to hide.”
Democratizing visibility into the network and application has accelerated problem-solving across McGraw-Hill. With customized ThousandEyes dashboards, non-network teams can monitor their application performance and self-diagnose issues directly, “It’s like crowdsourcing your NOC, letting other teams use the tools and have the ability look at the problem and solve the problem. We gave them some training, but they’re able to do it themselves and look at the results.”
Optimizing Digital Ecosystems: How Outage Insights Improves Overall IT Performance and Resiliency
Having context on online outages has allowed the McGraw-Hill team to take a more proactive and high-impact approach to managing its ecosystem of digital partners on multiple levels.
ThousandEyes’ ability to record the history of an outage allows for forensic post incident analysis, which can be invaluable in learning lessons on how to best deploy an online service. Mann affirms, “You always want to know your DDoS provider is operating the right way. We can see where the attack is and where the choke points are. We can also see the recovery. These are things that we can save and play back. We can learn from them. We can learn: What was our recovery time and how can we do better?”
The greater visibility provided by ThousandEyes also enhances the quality of conversation and partnership that McGraw-Hill experiences with external providers. Mann sees this as an iterative process, “Time and again, we will see problems in a service provider’s network, or a cloud service provider’s environment. We then go to them and say ‘we think there’s an issue with your SP,’ and we show the provider the public ThousandEyes share link. It helps them to pinpoint where the problem exists, and it totally changes the conversation.”
Having greater insight into Internet driven outages has also led to constructive feedback from McGraw-Hill that cloud providers can use to inform their product roadmaps and architectures—in turn making their cloud offerings more robust. Mann explains, “We talk to some of the biggest service providers out there, and we’ve been able to help them find some problems. On more than one occasion, some of their cloud offerings have been redesigned because of some of the issues we’ve been able to pinpoint via ThousandEyes.”
Automating ThousandEyes Testing via a New Cognitive AI Agent
A very recent addition to the IT team’s monitoring toolkit is Amelia, a new bespoke AI agent (developed by IPSoft) that has the ability to investigate a user’s performance issues, automate ThousandEyes testing and then escalate the findings to the appropriate team.
Mann reminisces about the genesis of the idea, “I was interested to see if we could formalize something where we can have a conversation with our infrastructure in a chat function. Instead of having top engineers utilizing ThousandEyes to figure out what’s going on, we can do so by questions, maybe even having a Personal Assistant for a service desk member, or allowing a user to talk directly to the platform to help troubleshoot. We partnered with IPSoft … and we’ve created the ability to have an AI conversation with the ThousandEyes platform.”
When a user is experiencing an outage and says that they are having a network issue, Amelia begins to ask questions about the network issue. Typical exploratory questions Amelia puts to a user might be: Are you in the office? Are you working remotely? Are you using a VPN or the public internet? In the background, decisions are being made by the cognitive AI on where to run the test from. Depending on the information provided by the user, Amelia can choose to run a test in the data center, the local office the user is located at, or an endpoint agent installed on the user’s computer. Once this process is complete, Amelia utilizes the ThousandEyes API to run point-in-time tests. If the user says the outage is infrequent, then Amelia will set up a scheduled test and continue to look at the tests over time in order to determine if it is truly a network problem, or something else.
Mann explains that the cognitive AI’s ability to understand the parameters of an outage is important to how McGraw Hill manages its escalation processes, “We don’t want it to send an escalation to the wrong team. If it’s app related, we want to get to the right team for that.” Once Amelia has processed initial data, it will make a determination and inform the user of the next steps. A typical reply to the user might be: this is not a network-related issue, let us take your details and the chat that we just had about what we’ve seen. We’ll give you a call right away.
Closing Lessons:
- Digital transformation is a marathon that requires continuous monitoring and validation. While Mann emphasizes that he’s always looking at new use cases to experiment with, he’s mindful that progress has to be constant and incremental, “We use ThousandEyes for our continuous validation. But we know it all can’t be done at once,” he states.
- Enforcing digital visibility and openness builds business trust. Mann appreciates that being able to monitor digital experience in new ways has led to a cultural meeting of minds between the IT team and the rest of the business, “We’ve learned the importance of sharing tools and being an open book when it comes to metrics and what the state of the network is,” he explains. “On our Slack channel, we stream a lot of what we know, they know we’re not hiding anything. That builds trust between departments.”