On September 6th starting at 10:40 am Pacific time, ThousandEyes detected a major disruption of user access to Wikipedia sites from around the world that lasted nine hours. Wikipedia’s German Twitter account shared that the site was “being paralyzed by a massive and very broad DDoS attack,” as translated from the below.
ThousandEyes was monitoring Wikipedia’s main page from Cloud Agent vantage points in 29 cities and 20 countries around the world and was able to capture the onset and progression of the attack as seen from the user, network and Internet routing point of view. Following is an analysis of the event. You can follow along via this ShareLink.
Loss of Availability
ThousandEyes detected a significant drop in HTTP server availability from around the world beginning at 10:40 am, with eleven agents primarily in Europe, Middle East and Africa losing access to the site. Throughout the attack, the locus of impact remained in EMEA. The “Status by Phase” legend on the lower left corner of Figure 1 below shows that most of the issues were in the “Connect” phase of the HTTP server, which means that a user’s computer would not be able to establish a three-way TCP “handshake” and establish an Internet connection for ongoing communication with the Wikipedia servers.
Over the course of the attack, user locations in the U.S., Mexico, Argentina and Brazil were also impacted on an intermittent basis, as seen in Figure 2.
At its worst points, user vantage points in India, South Korea, Hong Kong, Malaysia and Australia were also impacted, as seen in Figure 3.
Increased Response Times
While loss of availability was severe, it wasn’t the only impact on users. Even from locations from where users could connect to Wikipedia’s servers, HTTP response times (time to first byte delivered to the user browser) increased dramatically. Figure 4 shows that before the attack started at roughly 10 am PDT, Wikipedia had an average HTTP response time around the globe of 353 ms, and we can see that response time for the Hong Kong monitoring agent is 146 ms. The response time pie graph at the lower left of Figure 4 indicates that in normal operation, connect time is the smallest portion of the response time—62 ms on average globally vs 88 ms for DNS, 128 for SSL negotiation and 63 ms for server wait time. Reading the time series graph at the top of Figure 4, we can see that average response times spiked at multiple points over the nine hour event.
Fast forward to one of the response time spikes and there is quite a different picture, as seen in Figure 5. At this point in time (6:50 pm PDT), the global average response time has risen to 1.201 seconds—three and a half times normal. More specifically, response time from Hong Kong has skyrocketed to 2.32 seconds, nearly 16x normal. Of course, when consuming Wikipedia pages, a rise in response time isn’t nearly as damaging as when you’re playing a MMO, but it still makes for a very frustrating user experience.
Seeing the DDoS Attack at the Network Layer
So far, we’ve looked at availability and performance of the HTTP server layer, but ThousandEyes agents also perform network-layer measurements, and that’s where we actually see the impact of the DDoS attack. Figure 6 shows our Path Visualization view with a graph of packet loss over the duration of the event—we can see up to 60% packet loss across all monitoring agents—and below it a topological visualization with monitoring agents on the far left connecting over a variety of Internet networks to the green colored nodes on the far sporting IP addresses, which are Wikipedia hosting sites. 91.198.1764.192 is the Wikipedia Netherlands site, 18.104.22.168 is Wikipedia Singapore and 22.214.171.124 is Wikipedia in Virginia, US.
The nodes in the paths between the agents on the left and the destinations on the right are individual router hops, and we can see many that are circled in red, indicating by the thickness of the red circle the amount of packet loss, ranging from 17% to 100%. Some of these routers are operated by Wikimedia, but many are operated by upstream ISPs like NTT and Telia. This perimeter of router nodes with high packet loss is common when a site is under DDoS attack, since the volumetric flood of maliciously bogus traffic clogs the pipes and “denies service” to legitimate user packets trying to get to the Wikipedia servers, leading to the loss of availability or increased response times seen at the HTTP layer.
Remediation Steps Ensue, As Seen via BGP
DDoS attacks aren’t uncommon. In fact, a Cisco study predicted that by 2020 there would be seventeen million DDoS attacks annually. What is a little less common is to see the evolution of remediation. As Wikimedia shared on their blog as seen in Figure 7, they are continuously optimizing their response to a complex threat environment.
In the period immediately following the attack, Wikimedia began inserting Cloudflare in between their Virginia site and the rest of the Internet. We were able to catch this action in flight by looking further down into the BGP routing layer view. In Figure 8, we can see that Wikimedia sites in the US were advertised to the Internet using a large network address (or prefix) 126.96.36.199/22 that contains 1024 individual IP addresses. We can see that Wikimedia’s Internet network (known as an Autonomous System), AS 14907, is peered with seven upstream ISPs including Telia, NTT America, Zayo, and Telstra and 188.8.131.52/22 is reachable via AS paths transiting those networks.
At roughly 8:30 pm PDT, our BGP monitoring agents detected that two new, more specific subnetwork routes were advertised to be preferred over 184.108.40.206/22. One prefix was 220.127.116.11/23 and was advertised by Wikimedia, as seen in Figure 9. This prefix is reachable from the Internet via Telia, Zayo and NTT America as upstream ISPs. The other prefix, 18.104.22.168/23 was advertised by Cloudflare, as seen in Figure 10. Each /23 represents 512 individual IP addresses, so in effect, Wikimedia split its large prefix, place one half behind Cloudflare and maintained one half directly peered to the Internet.
In Figure 10 above, we see that there are some changes indicated by the profusion of red lines. The dotted red lines show where routing paths have been withdrawn, and the solid red lines show where new routing paths have been established. It’s a busy visualization, but in essence, all the new paths are going through Cloudflare, indicating that Cloudflare’s large-scale network has been inserted between this portion of Wikimedia’s network and the rest of the Internet. Cloudflare provides CDN and DDoS mitigation, among other services, so obviously the move was to increase protection against large scale attacks. The fact that only a portion of Wikimedia’s address space was cut over could mean that this move was perhaps executed in a slightly experimental fashion, possibly an A/B test of sorts. Nonetheless, by 10 pm PDT, 22.214.171.124/23 was fully front-ended by Cloudflare, as seen in Figure 11.
Wikimedia also started a process of cutting over AS Paths for its 126.96.36.199/24 Netherlands prefix to be reachable via Cloudflare starting at 9:45 pm PDT, and proceeded to incrementally shift paths over the next two days, ending up with the following AS Path picture seen in Figure 12, where most paths go through Cloudflare’s network.
What Should Enterprise Teams Take Away?
As mentioned previously, DDoS attacks are a sad fact of life in doing digital business. If Wikipedia, one of the largest media sites in the world, can be impacted, your business can be too. Clearly, taking proactive remediation steps to be prepared makes an awful lot of sense if, unlike non-profit Wikimedia, your digital business is built for the purpose of generating revenue.
But the lesson here goes beyond preparing for DDoS. No matter what the disruption, customers, partners and users are unforgiving of anything that stands between them and their digital goals. Even if you have DDoS mitigation in place, knowing how your user experience is being delivered is mission critical, and you need to see that from every vantage point that matters to your business. ThousandEyes offers you those vantage points and multi-layer visibility from synthetic transactions, to HTTP server availability and performance, to network paths, to Internet routing and even collective intelligence-based Internet outage detection in one view, so you can manage toward success in the face of all the unpredictable things that can go bump in the night on the Internet.
Finally, remember that this lesson doesn’t stop at web or other digital properties that your business owns. Gartner studies show that large enterprises now have dozens of critical cloud and SaaS business partners that are intrinsic to their business. Take a SaaS like Salesforce, where a manufacturer like Schneider Electric has 45,000 employees and hundreds of thousands of external customers and partners doing business across that platform. Knowing how app experience is delivered in normal times can give you a helpful baseline, so you can understand performance not only when issues are occurring but also understand where in your network or across the Internet (or whose network or cloud domain) the problem is happening in. In addition, you can better understand what sort of problem it is (DNS, Connect, SSl, or wait times—as seen in Figure 1). Without that knowledge, you can’t solve problems, communicate to employees and customers, and escalate to the right internal team or external provider. In short, without this sort of multi-layered visibility from all your important internal and external locations, you’re flying blind and that’s no way to operate in a cloud-based or cloud-first IT model.
ThousandEyes is a leading authority on Internet outages, so we offer these outage reports regularly to help educate the industry and our enterprise customers. If you’d like to continue getting this sort of insight, subscribe to our blog.
In the case of this Wikipedia attack, we’re offering a webinar to walk through our findings, so register now and share the registration with your colleagues to increase your level of awareness and education on how to deal with Internet and digital experience disruptions, so you can protect your business.