Outage Exposes the Need for Visibility
Limited visibility into ISP availability and performance became problematic during one such incident. During an outage within one of its ISPs, the firm’s network team spent more than 8 hours trying to triage the situation and understand where the problem really was. “We really had no visibility into anything outside of our walls, so it made it difficult to be able to find out where the problem really lied.” In light of this incident, the firm came to the realization that they needed visibility into external networks to be able to reduce the mean time to recovery for an outage like that.
Monitoring Connectivity to Data Centers
In order to provide continuous service to its clients, employees and physical branches, having the ability to monitor connectivity to its data centers from an external vantage point is extremely important—and something it recognized it lacked. “We simply didn’t have that kind of visibility for outside networks that we don't control.” Using ThousandEyes Cloud Agents, this financial services firm is able to monitor the reachability of its major data centers in large metropolitan areas globally—such as New York, London and Tokyo—to ensure employees and clients are able to access its services without disruption.
Gaining Visibility into DDoS Mitigation Effectiveness
Increased visibility into connectivity and routing at its data centers also gives this financial institution better insight into the performance of its DDoS mitigation vendor. “Previously, we didn't have any visibility into when our DDoS mitigation service was engaged, or how successful it was when active.” Using ThousandEyes, the firm can see when traffic is routed to their ASN and when a switchover takes place. This helps it verify that mitigation is working as expected and that the vendor is meeting its SLAs.
Monitoring the Internet for Outage Threats
While monitoring the reachability of its data centers provides the firm with alerts should service provider availability or performance become an issue, there is a lot happening across the global Internet outside of its four walls that influences what goes on within. With ThousandEyes Internet Insights, the team is able to see outages across the Internet and correlate them with their own incidents to understand the scope of an event.
Understanding Outage Scale and Impact
One example of when this came into play was during a recent Cloudflare disruption, which was caused by a service provider routing error. The team happened to be monitoring a fixed income circuit with ThousandEyes when it received an alert reporting packet loss (Figure 2). As they looked at the macro view of that test with Internet Insights, they were able to see that it was part of a much broader issue. “Our team saw that it wasn't just this circuit that was having issues, it was much larger. The route leak affecting Cloudflare was creating issues across the country. It was affecting Verizon, Level 3 and Google.”
This insight helped them to understand the scope and scale of an issue that impacted their clients. “When you start to hear reports of an outage, you're pretty much in the dark. But if you can tie it to another issue that's happening and its cascading impact across different service providers, it gives you an indication of what's wrong. Then you can communicate that out to your stakeholders and even your customers and employees.”
In another situation, a client of the firm was experiencing connectivity issues, yet there was nothing on their systems that indicated a problem. Using Internet Insights, the team was able to go back and match the times that they reported with outages seen in Internet Insights, which happened to coincide with a Verizon outage in the New York metro area. They were able to take that data and escalate it to Verizon and be able to address this client's issue. According to the team, “We'd never been able to have that kind of visibility or be able to corroborate what the client was saying to an ISP before—and from the perspective of the client, they see us as being proactive in helping them resolve an issue.”
The team also uses Internet Insights to understand the state of the Internet before the Stock Markets open each day. “There's a list of things we check, and one of them is the Internet Insights, to make sure everything looks clear.” In addition, the firm plans to use the visual dashboards available in Internet Insights for its next-generation NOC, where the subject matter experts (SMEs) of all the different functional groups, from security to web operations, sit together to triage situations that may have an impact across silos.
Delivering services over the Internet relies on an increasingly complex ecosystem of external, third- party providers. While there are many aspects of this ecosystem that businesses cannot fully control, having an awareness of outage events, big and small, across the global Internet can help you understand the environment you're in and respond with precision when an outage does affect you. “If you're providing services today and you rely on external parties to enable you to provide those services, you need Internet Insights.” Leveraging the macro and global views of Internet Insights, combined with the views of their own service-delivery paths, this financial institution has visibility into outage events that they never had before.
Figure 4: Impacted networks due to loss in Verizon (AS 701) stemming from the route leak
Start Monitoring Your Network