New Podcast
Managing Traffic During Peak Demand; Plus, Microsoft, Akamai Outages

Product Updates

Finding the Root Cause of Loss and Latency in Internet Facing Applications

By Pete Anderson
| | 8 min read

Summary


During my career I’ve seen all kinds of network performance problems: packet loss, latency, TCP issues, you name it. There are a lot of tools that can be used to track down these problems when they impact traditional enterprise applications. However, it’s always been very difficult to find the root cause of issues like loss and latency in Internet facing applications. I’ve often felt like I was trying to find a needle in the world’s biggest haystack.

Customers are having trouble accessing my site!

Availability is always a top concern for people who manage Internet facing websites and services. It can impact your customer base, your users, your partners, your vendors and the list goes on. There are a lot of tools out there that can tell you that your site is fully or partially down, but very few of them will help you figure out why. This is especially the case when you have only a partial or intermittent outage; as it becomes much more difficult to diagnose the issue. Figure 1 shows a screen that no site owner wants to see, with availability hovering around the 50% mark.

Figure 1: Financial services website with availability issues.
Figure 1: Financial services website with availability issues.

We can see right away that the site isn’t fully down, but it’s also clearly not in an optimal state. As shown in Figure 2, we can leverage the end-to-end metrics in the ThousandEyes platform to see that there’s been a substantial increase in packet loss that matches up with the drop in availability we observe:

Figure 2: High levels of packet loss match up with the drop in availability.
Figure 2: High levels of packet loss match up with the drop in availability.

However, there’s something else interesting going on in Figure 2; three of our test locations are bright green indicating there’s no packet loss. We can see those locations listed out in figure 3.

Figure 3: No packet loss is observed for Denver, Las Vegas or Toronto.
Figure 3: No packet loss is observed for Denver, Las Vegas or Toronto.

So what’s going on here? Why are five locations consistently showing loss and having availability issues while the other three are fine? ThousandEyes path visualization technology is uniquely suited to answer this question for us. In Figure 4 we can see that the five locations having issues connect to site we are monitoring via a different path and provider than the three who are working fine.

Figure 4: The five locations experiencing loss connect via Road Runner (Road Runner nodes are highlighted in yellow) while the other three connect via AT&T. Nodes experiencing loss are indicated by red circles.
Figure 4: The five locations experiencing loss connect via Road Runner (Road Runner nodes are highlighted in yellow) while the other three connect via AT&T. Nodes experiencing loss are indicated by red circles.

Not only are we able to quickly identify that the availability and packet loss issues are isolated to the Road Runner network, we can even see specific nodes where loss is occurring. We can also verify that locations connecting via AT&T are doing just fine. With ThousandEyes, you could even generate an interactive share with all this information and send it to your provider with a couple of mouse clicks as shown in Figure 5.

Figure 5: Interactive collaboration with a couple of mouse clicks.
Figure 5: Interactive collaboration with a couple of mouse clicks.

Some users are complaining that our site is slow!

Slow websites aren’t much more fun than sites that you can’t get to at all; no one likes to watch the loading icon spin around and around. When the response time issue is intermittent or it doesn’t impact every user, the problem just becomes that much harder to solve. Figure 6 shows an employee/partner portal for a large manufacturer experiencing a large increase in response time for several hours.

Figure 6: 2x-4x increases in response time observed.
Figure 6: 2x-4x increases in response time observed.

Taking a look at the ThousandEyes end-to-end metrics, we can again see that packet loss is a major symptom of our problem. However, as shown in Figure 7, once again only some locations are impacted.

Figure 7: Overall packet loss matches up with the observed response time increase.
Figure 7: Overall packet loss matches up with the observed response time increase.

In this case the site owner said that they only use one provider; can path visualization save the day again? Absolutely, even though all the locations access this site via AT&T we can see in Figure 8 that the loss is occurring only in one portion of AT&T’s network and can even identify specific nodes that are dropping packets.

Figure 8: All the locations experiencing loss transit via the node in the tooltip. The other locations are experiencing normal response times and no loss.
Figure 8: All the locations experiencing loss transit via the node in the tooltip. The other locations are experiencing normal response times and no loss.

Making the black box transparent

The Internet has often been referred to as a “black box” when it comes to troubleshooting. Very few tools provide any level of useful visibility into what happens inside the Internet and historically troubleshooting has involved a lot of finger pointing and everyone having a theory involving the problem not being their fault. ThousandEyes is revolutionizing how we troubleshoot these issues by providing transparency to the Internet and a common view all parties can use together when trying to find the root cause of problems. I encourage you to sign up for a free trial of ThousandEyes and see how path visualization can decipher the Internet for you.

Subscribe to the ThousandEyes Blog

Stay connected with blog updates and outage reports delivered while they're still fresh.

Upgrade your browser to view our website properly.

Please download the latest version of Chrome, Firefox or Microsoft Edge.

More detail