Learn more about the latest ThousandEyes innovations at Cisco Live! | June 2-6, 2024

The Internet Report

Outages at X, google.com, and jsDelivr + Why Details Matter

By Mike Hicks
| | 27 min read
Internet Report on Apple Podcasts Internet Report on Spotify Internet Report on SoundCloud

Summary

Explore google.com, X, and jsDelivr outages; plus, learn why every component in your service delivery chain matters for assuring great digital experiences.


This is the Internet Report, where we analyze outages and trends across the Internet, from the previous two weeks, through the lens of ThousandEyes Internet and Cloud Intelligence. I’ll be here every other week, sharing the latest outage numbers and highlighting a few interesting outages. As always, you can read the full analysis below or tune in to the podcast for firsthand commentary.

Internet Report on Apple Podcasts Internet Report on Spotify Internet Report on SoundCloud


Internet Outages & Trends

In life, we’re often told not to “sweat the small stuff”—to shake off the minor things so we can remain focused on the “bigger picture.” But ITOps does not mirror real life: the small stuff is often the most critical. A change to the smallest detail that’s either missed or unknown can quickly propagate through to affect the “big picture.” When that happens, pinpointing the “small thing” and correcting it (or rolling back a change made to it) can be the difference between a brief outage and a prolonged one.

This flow-through—where small things have big impacts—occurs with regular frequency, to organizations large and small. The larger the company, the more expansive and complex its systems, and the more moving parts—dependencies and interdependencies—to maintain visibility of.

A recent outage at CDN service operator, jsDelivr, reinforced the importance of keeping track of all the little things, watching every piece of your service delivery chain to maintain availability and resiliency. 

In jsDelivr’s case, the detail was an expired cert, which created issues serving content and impacted many websites that rely on the CDN service. jsDelivr said the occurrence was “completely unexpected and something we weren't prepared for,” breaking a system that had “worked great for almost 10 years."

In recent weeks, both X (formerly Twitter) and google.com also experienced outages. In both cases, it appears that services in the backend systems disrupted the connection between the frontend systems and the backend services. It is not entirely clear what caused the issue in the backend, but any impact on users was likely unintended. These outages serve as a reminder that even a minor change or disruption in the backend can have unexpected effects on other parts of the infrastructure, causing them to respond in unanticipated ways.

Read on to learn about all the outages and degradations, or use the links below to jump to the sections that most interest you:


jsDelivr Outage

On May 2, users of open-source CDN jsDelivr in Asia, Africa, and parts of Europe were affected by a more than five-hour outage to the service that resulted from one single detail that went awry.

jsDelivr’s architecture is described as a “unique multi-CDN infrastructure built on top of CDN networks provided by Cloudflare, Fastly, Bunny, and GCore.” The company also uses  “custom servers in locations where CDNs have little or no presence.”

According to jsDelivr’s detailed post-incident report, the root cause of the outage was a switch by one CDN network—Cloudflare—from DigiCert's certificate authority to Google Trust Services. This switch changed the domain validation method, causing a disparity in the routing between the main CDN providers. 

As a multi-CDN, jsDelivr routes traffic between providers based on internal rules. Since Cloudflare DNS hosting could not be used, jsDelivr had a unique setup where only Cloudflare acted as the CDN, while DNS was hosted elsewhere. To allow Cloudflare to manage certificates, jsDelivr added proper DNS records to third-party DNS providers. This system had worked for almost 10 years until the migration of certificate authorities made those validation records obsolete and switched to HTTP validation instead.

The switch to HTTP validation caused a problem as it wasn’t compatible with jsDelivr's approach. Depending on where the verification test came from, it could hit a different CDN provider and fail.

Unfortunately, jsDelivr was not aware of this change. When the previous DigiCert expired, Cloudflare tried to issue a new certificate using HTTP validation, which failed. As a result, it reverted to an old expired certificate from 2020. This resulted in an error message for all the users who were hitting Cloudflare's CDN based on jsDelivr routing.

jsDelivr said in its post-incident report that it will make a series of changes following the incident. This includes the way it handles “any critical changes by CDN providers,” which will now “immediately result in their deactivation from jsDelivr and manual verification after the fact to ensure the CDN’s stability with our specific setup.”

X Outage

ThousandEyes detected an outage affecting X (formerly Twitter) at around 7:15 AM (UTC) on April 29 that prevented some users from interacting with the service. Requests to the application kept timing out, indicating an issue with the service’s backend. The service outage was resolved at approximately 8:30 AM (UTC). 

Screenshot showing multiple regions impacted during the X outage
Figure 1. The ThousandEyes platform showed the X outage impacting multiple regions globally

A gradual accumulation of affected servers suggests a mounting burden on the system. As the outage began to affect some interactions with X services, we noticed a corresponding rise in page load time, which coincides with the apparent increase in impacted servers.

Screenshot showing increased page load times
Figure 2. ThousandEyes saw X users experience increased page load times during the outage

No networks were harmed during the making of this disruption, indicating that the issue lay with backend systems. Users could reach the service, but it wasn’t responding; everything on the page would have appeared functional but static. Manually refreshing the feed would have found no new posts, and attempts to write new posts initially appeared to work, but then timed out.

Screenshot showing network paths appearing normal
Figure 3. No network issues appeared to occur at the same time as the outage

In situations like this, we would usually expect to see an error message such as HTTP 503 Service Unavailable or 502 Bad Gateway. Sometimes, as we have seen with X in the past, a 302 (or 307) redirect to a static error message (such as X’s classic ice cream-themed error page) may indicate that there is an issue. However, in this particular case, we observed that for some connections, the redirection simply timed out. Based on the data available on the ThousandEyes platform, it appeared that the redirection continued to attempt to retrieve and request content as normal. In other words, it did not appear to redirect to the static error message page.

Screenshot showing timeout errors during the outage
Figure 4. During the incident, ThousandEyes observed the page help.twitter.com timing out

During the outage, ThousandEyes noticed that the intermittently appearing redirect seemed to be connected, but failed several times when attempting to activate or send details to the backend. This might indicate that there were no paths to the backend, making it inaccessible and causing a timeout which resulted in the cancellation instead of being unavailable and returning an associated status code. ThousandEyes often saw it fail on the activation object, which is the object used to hold the properties describing the environment and scope of an executing function. This object stores function arguments, and when activated, it causes that thing to function. (To see more, explore this outage further in the ThousandEyes platform.)

Screenshot showing intermittent errors
Figure 5. Intermittent connectivity attempts that time out before completion were observed during the outage 

After the service was restored, there appeared to be a noticeable increase in the time it took for the page to load. However, unlike the timeouts observed during the outage, it did seem like the page was fully rendered.

Screenshot showing elevated page load times after services were restored
Figure 6. Page load time appeared elevated after X services were restored, according to ThousandEyes data

While X has not yet issued any formal explanation, based on what ThousandEyes was able to observe, it appears to be a problem with the backend system, as indicated by timeouts, likely caused by connectivity issues rather than failed or unavailable services.

google.com Outage

At approximately 2:20 PM (UTC) on May 2, google.com experienced a global disruption, resulting in many users receiving HTTP 502 Bad Gateway errors. 

Screenshot showing global impacts of the google.com outage
Figure 7. Interaction with google.com impacted globally

The normal search process involves visiting the search page, submitting a search query, and receiving the results. During the outage, users appeared to be able to reach the search page but encountered errors when attempting to do anything beyond that. The incident was resolved around 3:15 PM (UTC). (Explore this outage further in the ThousandEyes platform.)

Screenshot showing timeout issues
Figure 8. ThousandEyes observed the page timing out during the incident

Screenshot showing an error message that a user would have seen
Figure 9. As the system “searches” timed out, users received an error message

The 502 Bad Gateway message indicates a technical server-side issue. It signifies that one server has received an invalid response from another. In simpler terms, when you encounter this message, it means you've connected with an intermediate device, like an edge server, that is responsible for retrieving all the necessary data to load the page. However, a glitch occurred in this process, leading to the appearance of the 502 Bad Gateway message.

Screenshot showing errors in response to search requests
Figure 10. HTTP 502 Bad Gateway messages observed in response to search requests

According to the Internet Engineering Task Force (IETF), the 502 status code suggests a few things. Firstly, you are working with a gateway or proxy server if you get a bad gateway notification. Secondly, the proxy attempted to communicate with the origin server. Lastly, the proxy received an incorrect response from the server.

502 gateway errors usually arise from domain names, where the computer performs a quick lookup to resolve a numeric IP to a standard format. If that system fails, such as when switching to a new hosting service, a bad gateway warning appears. Another cause could be too many visitors overwhelming a server, making requests impossible to resolve. Additionally, policy or security problems can hinder proxy communication.

It seems unlikely that the site's load caused the issue, as there didn’t appear to be any events that could have triggered an abnormal search load. Interestingly, while the 502 bad gateway error appeared gradually at the HTTP level, it was more of a "lights on/lights off" scenario when looked at transactionally. This could be indicative of a problem with backend name resolution or something related to policy/security verification that prevented the search from taking place (also known as suppression).

Screenshot showing issues with backend service
Figure 11. Interaction with backend services appeared to experience a sharp stop and start pattern

One important item to note is that the outage only seemed to affect google.com and its search function. Other services that rely on the search engine's functions appeared to be unaffected. This suggests that the problem was with the connectivity or linkage between the webpage and the search engine, rather than an issue with the search engine itself.

At the time of this blog’s writing, Google has not provided an official explanation for the issue, nor does it show up in the search status incident history.


By the Numbers

In addition to the outages highlighted above, let’s close by taking a look at some of the global trends ThousandEyes observed across ISPs, cloud service provider networks, collaboration app networks, and edge networks over the past two weeks (April 22 - May 5):

  • The number of global outages increased for much of April, but started decreasing steadily in the month’s final few weeks and the beginning of May. From April 22 - 28, there was an 8% reduction in outages compared to the previous week, with outages dropping from 170 to 156. This trend continued in the following week (April 29 - May 5), with the number of outages decreasing by 3%.

  • The United States followed a similar pattern, experiencing a 36% decrease in outages from April 22 to 28, followed by a 7% decrease in the week following (April 29 - May 5).

  • From April 22 to May 5, only 34% of network outages occurred in the United States, breaking the trend of at least 40% of all observed outages being U.S. centric.

  • Looking at the month-over-month data, outage numbers slightly increased in April compared to March. In April, there were 687 global outages, which is a 1% increase from March’s 678 outages. Similarly, the number of outages in the United States increased by 3% from 289 to 299.

Bar chart showing global and US outage trends over the past eight weeks
Figure 12. Global and U.S. network outage trends over the past eight weeks

Subscribe to the ThousandEyes Blog

Stay connected with blog updates and outage reports delivered while they're still fresh.

Upgrade your browser to view our website properly.

Please download the latest version of Chrome, Firefox or Microsoft Edge.

More detail