Update: This blog was originally posted on Tuesday at 10PM. We have updated the blog to reflect new information.
On Tuesday, October 8, 2019, utility company PG&E announced that it will shut off power to some 800,000 customers in Northern and Central California in an effort to prevent wildfires in certain parts of the state. It directed California residents to visit its website to see how the planned power outages might affect them. Unfortunately, chaos ensued as the PG&E website buckled under the sudden surge of traffic, which appears to have triggered a DDoS mitigation event that went awry, blocking legitimate users from accessing their site. In this blog post, we’ll recap the timeline of events as we saw it through the ThousandEyes platform.
Feel free to explore the outage with this interactive sharelink.
Starting early Tuesday morning, consumers from California and across the United States were unable to reliably reach the PG&E website. At the time of publishing, the issue is still ongoing. While the primary cause of the outage appears related to congestion in the PG&E data center, it also appears that a misconfiguration of PG&E's DDoS mitigation service (Sucuri) is resulting in users seeing HTTPS (SSL) certificate errors.
At the web application layer, ThousandEyes observed widespread and consistent availability issues where attempts to load PG&E's web page would result in either connection timeouts or SSL certificate errors (Figure 1).
At the network layer, ThousandEyes saw network traffic targeting PG&E's website cycling between PG&E's data center and what appears to be PG&E's DDoS mitigation provider (Sucuri). Traffic targeting PG&E's network (AS 2013) experienced significant network packet loss, whereas traffic targeting Sucuri did not. Taking a look at one of ThousandEyes vantage points from San Francisco, we can clearly see the oscillation pattern (Figures 2 and 3 below), which appears to occur roughly every 30 minutes.
To understand what's going on a bit more, we can correlate the behavior we're seeing at the network layer with what we're measuring at the web application layer. When we select a timeframe where the traffic is targeting PG&E's network (AS 2013) and network loss is high (e.g. 12:55 pm), and then jump to that same timeframe for the HTTP server view, we see that we're getting HTTP connect errors. This would be expected when there is packet loss. An end user would just experience this as a spinning browser window or timeout.
However, when we select a timeframe when traffic is targeting Sucuri (e.g. 12:45 pm) and we jump to the same timeframe in the HTTP server layer we see that the web connection is receiving an SSL error, "SSL: no alternative certificate subject name matches target host name 'www.pge.com", as seen in Figure 4.
A customer attempting to browse PG&E's website at this moment would have seen an SSL warning in their browser stating that the SSL security certificate could not be verified, as seen in Figure 5.
Viewing the certificate from the browser would confirm that the DNS Names field did not contain "www.pge" or "pge.com." Instead they were only referencing the wild card *.sucuri.net, as seen in Figure 6.
What’s Behind this Cycle of Misery?
It certainly is odd that a mitigation provider wouldn’t have a proper SSL certificate in place for a client website. So, we dug a bit further. Based on the network path view shown in Figure 3, we know that when "pge.com" traffic is being redirected to Sucuri, traffic is targeted to the following IP address: 188.8.131.52. We tried connecting directly to this site using HTTP (rather than HTTPS, so no SSL certificate will be required), and that resulted in being served a firewall configuration error page, as seen in Figure 7.
What could be going on here? (Warning: thoughtful conjecture to follow).
We know that there was a total loss of connectivity to the PG&E website, depicted through the severe packet loss we saw in Figure 2. We suspect this is related to the massive increase in traffic related to the planned PG&E outages on Oct 9th and 10th. News announcements urged customers to check the PG&E website to see if they would be affected, but unfortunately it seems PG&E was unable to handle the sudden surge of traffic.
The cycle we observed in the network layer as PG&E's data center network started dropping more and more packets resulted in traffic being redirected to Sucuri. As mentioned, we suspect that Sucuri is PG&E's DDoS mitigation service.
One possibility is that the increase in traffic was mistaken for what transpires during a DDoS attack. Most DDoS services can be configured to use more sophisticated means than simple rate limiting to differentiate between normal high traffic and DDoS attacks. So this scenario would likely have been caused by a misconfiguration redirecting traffic from PG&E to Sucuri.
Another possibility is that the Sucuri mitigation policy was behaving as expected, and that as connection errors to the PG&E website increased, the Sucuri DDoS mitigation service was deployed on-demand, in an attempt to serve up a functioning PG&E website.
In either case, it's clear that the Sucuri service was not prepared to handle PG&E web traffic. At a minimum, it was not configured with the proper SSL certificate to serve PG&E sites. The "Sucuri Website Firewall - Not Configured" page is also strong evidence that the PG&E service deployment was not ready for prime time.
Furthermore, the nearly regular 30-minute intervals that ThousandEyes shows traffic switching between PG&E and the Sucuri mitigation service hints that there may be a combination of mis-configuration and automation at play here. If the Sucuri service is being triggered despite its incomplete configuration, it may also be falling back on a default behavior (again signalling incomplete configuration) that is automatically redirecting traffic back to the PG&E origin site, setting up a kind of feedback loop.
Update: Wed, Oct 9, 2019 10:30AM
As the outage continued throughout the day Tuesday, PG&E appears to have introduced another mitigation service Tuesday evening. As you can see in Figure 8 below, at around 8:50 PM (PST) traffic to pge.com was directed to Defense.Net, which is a DDoS mitigation and protection service offered by F5.
Looking at the application layer for the period of time immediately following the switch to F5, we see that the PG&E website is still nearly completely inaccessible (Figure 9).
However, around 1AM PST on Wednesday morning (about 4 hours after activating the F5 DDoS service) things start to improve, and by 1:20 AM PST, pge.com website had nearly 100% availability and accessible from ThousandEyes vantage points. We can see this in the page load geographic map in Figure 10 below.
At the same time, ThousandEyes' application layer transaction test from our San Francisco vantage point confirms PG&E’s website is accessible (Figure 11). We also see from the waterfall view that during this time the PG&E site is being served by F5’s mitigation service (Defense.net).
This four hour delay could either be because it took some significant time after activating the F5 service to getting it properly configured and/or migrated. It could also be that the reduction in traffic to PG&Es website around 1AM was enough to no longer trigger the conditions leading to the outage and DDoS mitigation cutover.
So, was that the fix? Unfortunately, it does not look to be so. Around 6:15AM PST it looks looks as if things are back to square one. The PG&E website once again appears completely inaccessible with just about all connection attempts resulting in either HTTP connection or SSL errors. (Figures 12, 13)
Furthermore, looking at the network layer visibility, we see the same pattern from yesterday's outage: massive network level packet loss accessing PG&E’s website and a regular oscillation of traffic between PG&E data center locations and the Sucuri based mitigation service. We no longer see PG&E's website being served by F5's DDoS service. And unfortunately, as of 11 AM, we still see the firewall misconfiguration message when accessing the IP address of PG&E's mitigation service provided by Sucuri.
While the circumstances surrounding this outage seem to be a series of unfortunate events, it clearly points to the complexity of today's modern websites. With dependencies on external services and the inherent unpredictability of the Internet, we're reminded that you need to have the right visibility in place to make sure that when the lights go out, you don't get stuck in the dark.