There’s a set of tools that all network engineers are familiar with, and have been around for years, almost as long as the Internet itself in some cases. Yes, I’m talking about tools like traceroute, mtr, iperf and ping. These tools have been the bread and butter of network troubleshooting and can be very useful for simple tasks such as checking if a host is reachable, determining routes to a destination, and measuring network latencies. They are good enough in most cases to perform these basic tests and that’s part of the reason they are so popular. However, they can reveal some serious inaccuracies when doing more in-depth diagnostics, and not much effort has been spent in understanding the bias and limitations inherent with the provided data. In this post we will dive into some of these limitations.
The original version of traceroute was written by Van Jacobson in 1988. Traceroute sends "probe" packets with TTL (Time to Live) values incrementing from one, and uses ICMP "Time Exceeded" messages to detect router "hops" on the way to the specified destination. It also records "response" times for each hop, and displays loss and other types of failures in a compact way.
- The protocol used by traceroute can make a big difference.The default Linux traceroute uses UDP in the probing packets, but ICMP and TCP options are also available. However, UDP packets are often blocked by firewalls, so if you’re trying to get to a web server, this might not be a good option. Also, you could use ICMP, but ICMP packets can also be blocked at the destination. Even worse, ICMP packets can follow different routes than TCP packets, specially true for ISPs that want to make their networks look “faster” than what they are. Unless you’re trying to hit a network device, TCP probes will give you the most accurate results. A basic rule of thumb is that the final destination needs to be able to answer with something different than ICMP Time Exceeded so we know we reached it. This way you can also tell if the final destination is down.
- Load balancing can distort discovered routes. Because traceroute relies on multiple probes to discover a given path to a destination, there can be cases where load balancing in the middle of the path distorts the inferred route. This behavior is explained in detail in the Paris-Traceroute paper first published in 2006; there are ways to overcome it, that unfortunately the default Linux traceroute doesn’t use.
- You can’t tell the difference between muted interfaces and real loss. Muted interfaces are those that never reply with ICMP Time Exceeded packets. With a single traceroute run it’s virtually impossible to tell the difference between a muted interface and a loss episode, e.g. a case of where the packet either got lost on the way to the interface, or the reply from the interface was lost.
- You can end up with a very incomplete view of the end-to-end path. Most pair of nodes in the network have more than one possible route between them. To explore all the alternative routes you need to issue several probes from the source to the destination. This can be a problem if you don’t have a way to solve the load balancing distortion (2) above.
- MPLS can distort per-hop delays. This is basically caused by the u-turn behavior of some MPLS tunnels. More details in my previous blog entry. Don’t be surprised if the per-hop delays traceroute gives you can look faster than the speed of light.
MTR, aka My TraceRoute is a streamlined version of traceroute, that periodically sends probes with different TTLs (usually 1 probe per TTL per second) to compute the average response time and loss per hop. Besides suffering from the same problems of Traceroute above, MTR uses ICMP Echo Requests by default, and are subject to ICMP rate throttling by interfaces, as well as ICMP Time Exceeded throttling, that will be masked as “Loss”. Furthermore, MTR does not keep per path state, everything is per hop, so in case of multipath, multiple hops will be aggregated under a single number, even though they belong to very different paths. This makes it harder to use in troubleshooting scenarios.
Host Loss% Snt Last Avg Best Wrst StDev .... 8. 220.127.116.11 10.0 94 181.2 180.6 180.4 183.2 0.6 ae-47-47.ebr2.NewYork2.Level3.net ae-45-45.ebr2.NewYork2.Level3.net ae-46-46.ebr2.NewYork2.Level3.net
In the MTR example output above, there are 4 interfaces on hop #8 because of multipath routing. However, we can’t tell from which interfaces the 10% comes from, or which interfaces the latencies refer to. Not only that, the 10% in this does not reflect real application loss since it comes from ICMP throttling, in actuality there is no real loss here.
In sum, traceroute and MTR can be used for very basic troubleshooting, but their results can’t be taken at face value. At ThousandEyes, we have given a lot of thought to overcoming the limitations of traditional tools such as traceroute. Our Deep Path Analysis technology addresses most of the issues above. In the second part of this blog article I will be talking about the limitations of iperf for bandwidth measurements, stay tuned.