Product News
Announcing Cloud Insights for Amazon Web Services

Industry

Discovering Latent DNS Issues: A Quad9 Case Study

By John Todd
| | 9 min read

Summary


Quad9 has deployed its public DNS resolver solution at nearly 150 locations worldwide, in 84 nations. With such a widely-distributed set of servers and services, having adequate visibility into performance behaviors is a critical and difficult task when validating operational metrics and alerting on out-of-normal conditions. Both our core DNS service, as well as the BGP configurations we use, need external monitoring and trend analysis, and ThousandEyes is helping us in both areas.

Spot The Failure

Quad9 systems are using DNS "anycast" meaning that each ThousandEyes probe (or end user) will reach the server on the 9.9.9.9 address (or one of our other anycast addresses) in the city and point-of-presence (POP) that is closest, network-wise. Our IP addresses "exist" in every location where we have our services installed. Our broad distribution requires an equally broad set of monitoring nodes — not just common hosting providers in densely networked nations, which tend not to catch problems with POPs in areas with less network density. Since our anycast nodes are so numerous, we value ThousandEyes' ability to tell us where our traffic is being terminated. We use their monitoring perspective from a wide variety of end-user and hosting networks to provide a unique view of our service.

To help in monitoring, Quad9 systems which receive a special DNS query ("dig @9.9.9.9 CH TXT id.server") will respond with the name of the node which received the query. When sending this query using a ThousandEyes DNS Server test, the ThousandEyes interface reports the contents of the TXT query in an easily-sortable way. ThousandEyes lets us see how geographic routing is working from the networks on which ThousandEyes Agents sit. Is traffic from Greece going to our systems in Frankfurt, or Amsterdam, or Turkey? Are there problems with specific cities? Are there problems with specific client or transit networks? When network disruptions occur, how do certain paths shift?

A Mysterious DNS Performance Issue

In late 2018, using the DNS Server test, we found a problem that was otherwise invisible to us in our Johannesburg, South Africa node (JNB). The 95th percentile latency results for queries that were being terminated to our JNB location were very high, according to the ThousandEyes reports, while we could see nothing obviously wrong using our own internal tools. The alerts from ThousandEyes made us more closely examine the site. We found that our hardware in Johannesburg was replying to roughly one out of every 17 external DNS requests with a very long return time — sometimes 1800 milliseconds or more. These lagged (or lost) queries were randomly distributed in the general DNS traffic with no particular commonality, and seemed to be only visible to external users and not our internal monitoring queries, and there were no secondary telemetry data that we had which would have highlighted those packet failures. Our servers reply to the majority of our queries in under one millisecond, so 1800 milliseconds was clearly a significant problem.

JNB was one of our first locations to be brought up when the service was being rolled out in 2017, and was on older hardware slated to be swapped out. While further debugging may have narrowed the problem down to a CPU, memory, network card, or another hardware issue, it was determined that an easier and more worthwhile solution was to accelerate the deployment of all new equipment at the site. This also increased the ability to scale up for that region as we also expect demand to naturally increase over time.

Immediately upon switching out the new servers for the old systems, the lookup times from ThousandEyes probes in Johannesburg registered roughly 1ms latency on queries, with no spikes or systemic losses since then. Without ThousandEyes, we would not have noticed this "slow-drip" degradation of services, and we may have had end users who experienced slower or inconsistent results in that location for much longer. This was one of many performance and monitoring wins we’ve had with the system, and we’re moving forward with other ways to test and measure this by integrating with our internal tools as well.

BGP: Don't Touch That Dial

We're also using ThousandEyes to monitor traffic behaviors as a result of BGP modifications. Our network has more than 140 locations in which we peer with other providers over thousands of peering sessions, mostly at IX locations (via our partner, Packet Clearing House.) Tuning announcement community strings and AS-padding is a perilous task — often minor changes to improve one peer or transit provider's path will cause unexpected negative results with other traffic, often in regions or on paths that seem to be entirely unrelated.

The dangers of invisible negative results make many network operators hesitant to make changes to BGP, with good reason. Unless there is near-real-time awareness of the results of changes, these types of tuning modifications often lead to difficult-to-diagnose end-user complaints. We use the loss and latency figures for DNS results to determine changes over time, and there is a set of screens in the ThousandEyes interface that allows us to page through many of the test locations to quickly see via graphs if latency or loss is higher or lower than when we made the last BGP changes. Once we've identified a site that has seen a change, we can "zoom in" and see the path modifications over time before and after the announcement modification.

Since we have the visibility of ThousandEyes' UI and Cloud Agent set combined with our own telemetry, we have reasonable confidence that we can understand the results of BGP changes that we make to each of our locations to create better latency and availability results. ThousandEyes lets us know if the changes we made to fix a problem have caused other problems, or if they've improved performance across the board. Often it's a game of whack-a-mole where one improvement will cause a problem that has to be solved another way. BGP is a blunt instrument, though with communities there is some finesse that can be applied to get results that are better than with no tuning at all. While we'll never have perfect visibility, being able to combine our telemetry results with the path data from ThousandEyes gives us confidence that we're improving the service experience for end users.

Subscribe to the ThousandEyes Blog

Stay connected with blog updates and outage reports delivered while they're still fresh.

Upgrade your browser to view our website properly.

Please download the latest version of Chrome, Firefox or Microsoft Edge.

More detail