On Tuesday, July 25th, 2017 at 4:25am Pacific, all services related to Marketo’s domain “marketo.com” began experiencing an outage. The outage continues to have impacts at the time of writing. Marketo is a marketing automation software widely used by companies around the world—when Marketo went down on Tuesday, forms on many companies’ websites broke completely. Many users were unable to access the Marketo platform or even email Marketo employees, since Marketo mail servers were also inaccessible.
Around 10pm that night, Marketo’s CEO, Steve Lucas, sent out an email that said, “We renew thousands of domain name properties we own every year with precision, yet the auto renew process for registering our main domain, Marketo.com, failed.” Our analysis confirmed that Marketo indeed failed to renew its domain name. When the domain name expired, traffic to Marketo was redirected to an IP address in another provider called Confluence Networks, effectively blackholing that traffic. Read on to understand the details of how this event unfolded.
Severe and Widespread Impacts to the Marketo Service
Beginning at 4:25am Pacific, our HTTP Server test to https://app.marketo.com began observing major impacts to the availability of the service. Availability dipped down to levels around 60-70%, while packet loss also spiked to around 20%. To see the interactive data, feel free to explore this share link.
By looking at the Path Visualization, we can understand where that packet loss is occurring. Traces are terminating in a variety of ISPs (NTT America, Cogent and Tinet) on their way to the same IP address, 208.91.197.132. This is strange—while the other test traces are heading to destinations in Akamai, Marketo’s CDN provider, these terminating traces are attempting to reach this IP address that belongs to Confluence Networks, based on WHOIS records.
208.91.197.132 is clearly not a legitimate IP address—there is something going wrong on the DNS level. So what happens when we set up a few DNS tests to Marketo’s nameservers?
Strangely Similar Symptoms on the Nameserver Level
We set up tests to Marketo’s nameservers that query for the CNAME record of app.marketo.com, and then found that the impacts here looked very similar to what we saw in the HTTP Server test. Availability dropped, packet loss spiked and specific test traces terminated on the path. Explore the data here.
Marketo hosts both of their authoritative nameservers on their own domain, so traffic normally travels to Marketo-owned IP addresses. However, traces from a number of Cloud Agents again attempted to reach the same IP that we saw before: 208.91.197.132. This is true for both authoritative nameservers, ns1.marketo.com and ns2.mktdns.com.
We’ve seen that DNS lookups of app.marketo.com, ns1.marketo.com and ns2.mktdns.com are all resolving to the same bogus IP address (208.91.197.132) for a subset of our monitoring points. Additionally, traffic to this IP is being dropped by all Internet service providers, so the endpoint is certainly not functional. The DNS has clearly been poisoned in some way, but how?
Clues in the WHOIS
At the beginning of the Marketo outage, we performed a WHOIS lookup for marketo.com and got some interesting results. First note that the creation and expiration dates both occur on July 23rd. This outage occurred on July 25th, which is suspiciously close to that date. Because domain name renewals generally occur on annual or multi-year cycles, it’s likely that the Marketo domain expired on July 23, 2017.
As further evidence of expiration, Marketo’s nameservers are listed as ns1.pendingrenewaldeletion.com and ns2.pendingrenewaldeletion.com. These are the nameservers that the registrar, Network Solutions, uses for domains that have expired. After querying these two nameservers for random domain names, we found that they always return the same IP address in Confluence Networks (208.91.197.132) that we saw previously, regardless of the domain name.
Based on war stories from a similar event in 2013, Network Solutions transfers expired domains to their partner, Confluence Networks, in order to monetize the traffic sent to those expired domain names. This is exactly what we saw: when Marketo’s domain expired, Network Solutions changed their nameservers to ns1.pendingrenewaldeletion.com and ns2.pendingrenewaldeletion.com, which direct traffic to one specific IP address in Confluence Networks (208.91.197.132).
Though Marketo and its many customers noticed the issues immediately, the damage was done. According to the dig output, the poisoned NS records had very long TTLs of almost two days, so once they were cached at a given DNS server, the server would not perform another lookup for marketo.com’s NS record until two days later, unless the caches were flushed. This kind of cache poisoning can also spread like wildfire, as DNS servers share DNS information among each other. As a result, it may be two full days before the impacts to Marketo’s service completely disappear.
Waiting on TTLs
Once Marketo’s domain was properly renewed and the NS records were updated to reflect Marketo’s actual nameservers (ns1.marketo.com and ns2.mktdns.com), it became a waiting game. Marketo’s team may have reached out to major networks and ISPs to have them flush their DNS caches in order to quicken recovery, but a significant number of users may simply have to wait for the 2-day TTL on those NS records to expire before being able to look up the correct address and access Marketo.
We set up a test to run DNS queries for the A record of app.marketo.com, from roughly 1,700 DNS resolvers across 38 countries and 540 networks. This helped us understand what percentage of vantage points (and from that, the rough percentage of global users) continued to see the bogus Confluence Networks IP address and thus could not access Marketo. Feel free to explore the share link.
When we set up the test around 10:45am Pacific on July 25th (around 6 hours after the outage), the penetration of the bad IP address (208.91.197.132) was still significant, roughly 30% of vantage points. It slowly decreased over the next day, and it is now around 11% at the time of writing (on July 26th, 11:45am Pacific).
The US was among the hardest hit countries, with 44% of vantage points affected at the start of the test. We can also use this test to monitor which networks have flushed their caches, and which continue to persist the bogus DNS records. AT&T, for example, has not yet flushed all of their caches and still had 13 affected vantage points more than a day after the start of the outage.
Lessons Learned
The most obvious takeaway from this outage is to ensure that the responsibility for renewing a domain is clearly allocated to a team or team member, and that that responsibility is properly passed on when a team experiences turnover.
Further, keep in mind that long TTLs can directly increase the recovery time of a DNS-related outage. In this case, Marketo’s team renewed the domain and fixed the DNS records quickly, but due to the long 2-day TTL on the inaccurate records, those bad records remained cached on many DNS servers around the world. As a result, many Marketo users may be completely unable to access Marketo services for as long as two days.
The data presented in this post can be collected through DNS tests, which are available through a free ThousandEyes trial. Sign up to ensure that critical services like DNS function exactly as you expect.