This is the Internet Report, where we analyze outages and trends across the Internet, from the previous two weeks, through the lens of ThousandEyes Internet and Cloud Intelligence. I’ll be here every other week, sharing the latest outage numbers and highlighting a few interesting outages. This week, we’re also featuring a conversation exploring the EU’s Digital Operational Resilience Act (DORA) with special guest Bernie Clairmont, Product Solutions Architect at ThousandEyes. As always, you can read the full analysis below or tune in to the podcast for firsthand commentary.
Internet Outages & Trends
The financial services sector worldwide is undergoing transformation to meet increasing customer demand for digitized forms of engagement. The challenge faced by financial institutions around the globe is maintaining service performance and resilience. In a world of digital payments, there is barely a good time for scheduled maintenance, let alone an unscheduled outage. Yet, banks and other financial institutions are repeatedly being tested.
In recent weeks, two banks (BMO and Scotiabank) had customer-facing issues. Earlier in October, customers of Australia’s Westpac reportedly experienced days of issues. We’ve also previously seen multiple banks' services impacted by problems affecting a critical piece of shared infrastructure.
The architecture of banking systems and payment rails is firmly on the radar of governments and regulators. One key example is the European Union’s Digital Operational Resilience Act, known as DORA, which is coming into effect on January 17, 2025. It aims to strengthen “the IT security of financial entities such as banks, insurance companies and investment firms.”
Our podcast this week discusses what DORA adherence means for organizations inside and outside of the EU, steps that companies can take to prepare, and more.
As usual, we’ll also unpack recent outages from the last few weeks, examining the DDoS attacks against the Internet Archive as well as some power and cooling problems at Google Cloud. We’ll also revisit a previous Azure outage, which may lead to changes in how the cloud provider shares updates about service disruptions.
Read on to learn more, or use the links below to jump to the sections that most interest you:
DORA: What ITOps Teams Need To Know
The Digital Operational Resilience Act (DORA) goes into effect on January 17, 2025, and financial institutions serving the EU will need to meet an enhanced set of requirements related to risk management, network resilience, and incident reporting.
While DORA is directly applicable to EU financial institutions, it prompts important discussions about resilience and ensuring digital experiences that are relevant for all IT operations teams, regardless of industry or region.
DORA establishes a best practice we often discuss on The Internet Report: the importance of taking responsibility for your entire service delivery chain, including the parts you don’t directly control. Under DORA, financial institutions are required to consistently monitor both their own ICT infrastructure and that of their third-party partners.
Tune in to the podcast to hear The Internet Report team and special guest Bernie Clairmont, Product Solutions Architect at ThousandEyes, dive deeper into the following:
-
The impact that DORA will have on EU financial institutions and why ITOps teams around the globe—regardless of industry—should be aware of these enhanced requirements
-
What EU financial institutions can do to prepare for DORA—and ongoing steps they’ll need to take once it goes into effect
-
Why it’s so critical to have backup systems in place and regularly test them to make sure they’re ready to go if an outage happens
-
The need for a deep understanding and comprehensive visibility of the full service delivery chain, including the third-party providers that it relies on
-
Why ITOps teams need a full view into their traffic in flight, understanding where their data is at all times and keeping in mind data sovereignty and other important considerations
Learn how the ThousandEyes platform helps financial institutions observe, comprehend, and take action to assist with DORA compliance. Read the white paper, today!
BMO Disruption
BMO (Bank of Montreal) customers were unable to log into online banking after a glitch on October 23. The bank acknowledged the problems in social media posts, noting that its app, branch, and ATM networks were unaffected.
The issue manifested as HTTP 500 internal server error messages for users, causing their requests to access the online banking platform to time out. Tests conducted by ThousandEyes revealed that timeouts occurred when attempting to access online banking services during the disruption window.
These problems suggest that one or more web servers in the cluster supporting the online banking platform, or possibly the load balancer in front of the cluster, experienced difficulties.
The reported duration was 2.5 hours. BMO previously had an issue with its online banking platform back in May, which was attributed to a false alarm being triggered in a data center.
Internet Archive Outage
The Internet Archive, including the Wayback Machine, were knocked offline by days of distributed denial-of-service (DDoS) attacks in early October. It’s not the first time the nonprofit digital library has been targeted in this way, but the October DDoS was noteworthy in part because it was one of three cyberattacks against the service and its infrastructure in just one month.
Specifically on the DDoS, founder Brewster Kahle’s social media posts over several days revealed the persistence of the traffic flood. At least three distinct IP addresses associated with archive.org were targeted. According to a company statement, the archive took its services down following the attacks to perform upgrades, and brought them back up slowly over some days.
ThousandEyes observed a significant increase in the loss rate at the last hop in the network path—the final segment connecting the end user to the Internet Archive. This spike could indicate that a DDoS attack is occurring.
The observation of a high packet loss rate was followed by a reduction in page load times, accompanied by HTTP 503 service unavailable errors. This is consistent with some components of the webpages becoming unavailable, which is a telltale sign that the backend infrastructure may have been taken completely offline.
Google Cloud Outage
Google Cloud’s europe-west3-c zone was disrupted for over 7.5 hours starting in the early evening on October 23 (PDT). The problems impacted a number of services, including Compute Engine, Cloud Pub/Sub, Dataflow, and Google Kubernetes Engine (GKE).
The post-incident report pinpoints the root cause as a power failure that happened in a single data center within the europe-west3 region. “This failure degraded the building’s cooling infrastructure, leading to a partial shutdown of the europe-west-c zone to avoid thermal damage and causing Virtual Machines (VMs) to go offline,” Google Cloud said.
The outage pattern is a familiar one in power-loss scenarios, where the loss of mains power puts pressure on cooling equipment, which in turn can cause ambient temperatures to rise in certain rooms or halls, and lead servers and other infrastructure in the space to get powered down—sometimes gracefully or sometimes ungracefully. Ungraceful shutdowns can cause further issues with corruption of hardware or the data on it.
Google Cloud is taking several steps to guard against another issue like the October 23 incident including digging deeper into the cause of the “electrical arc flash” that led to the power failure and ensuring that other data centers do not have similar problems. Google Cloud is also “further hardening GCP’s Persistent Disk services to prevent any regional impact during single-zone issues.”
Incident Retrospective: Azure Virtual Desktop Outage
Over the past year, Microsoft Azure has published several video post-incident retrospectives, providing visibility into its response and implementation of “lessons learned” from major incidents. One of the previous retrospectives provided insight into multiple cable breaks that occurred off the west coast of Africa earlier in the year. The Azure Virtual Desktop outage on September 16, covered in a recent Internet Report blog, is one of the latest to receive an incident retrospective video.
As a refresh, a subset of Azure Virtual Desktop users in various U.S. regions “experienced failures to access their list of available resources, make new connections, or perform management actions.” The problem was attributed to a degradation with a SQL database that stores configuration data and an associated process that replicates that configuration data from the primary database to “multiple secondary copies,” which fell “several hours behind” the primary.
In reviewing the incident retrospective video, we were particularly interested in Microsoft’s discussion of how they chose to communicate the outage on the day it happened. Regular readers of The Internet Report will know we often analyze outages where communications to status pages are delayed or nonexistent. The consistent takeaway is to avoid relying single-handedly on status pages for your information, and to ideally have independent visibility in place.
Microsoft communicated the problems with Azure Virtual Desktop about one hour and 15 minutes after the customer impact. That communication was made directly to impacted customers via the Azure management portal, and then later to the public-facing status page. Interestingly, there was some discussion during the incident retrospective as to whether this incident was status page-worthy, as anyone impacted was already receiving updates directly. Sami Kubba from Azure Communications noted: “We sent a resolved communication and we put it on our status page. Looking back, I’m not sure if that was the right decision to make… I don’t know if this is the kind of thing that would belong there [on the status page] because every customer who was impacted was receiving communication.”
Microsoft also has an internal capability called the “Brain” which appears to trigger automated communications to customers within 15 minutes of an incident, presumably based on recurrence of a prior outage pattern being detected. Kubba noted the specific “service and scenario” in this Azure outage “is not onboarded onto Brain, but we’ll go and evaluate what’s the likelihood of this failing again, what is the blast radius, and how painful that is” before deciding whether to add it to the Brain or not.
This insight into the decision-making around when and where to share status updates provides helpful context and further underscores the need for independent visibility. Updates from companies may not be immediate and having independent visibility can help your team shorten the mean time to detection and resolution of an incident, augmenting any official sources that are public or in an authenticated portal.
By the Numbers
Let’s close by taking a look at some of the global trends ThousandEyes observed across ISPs, cloud service provider networks, collaboration app networks, and edge networks over recent weeks (October 21 - November 3):
-
The downward trend observed in the previous period reversed in the most recent period, with the total number of global outages increasing. In the first week of this period, ThousandEyes recorded a 17% rise in outages, with the number increasing from 155 to 181. This trend continued into the following week, where, between October 28 and November 3, outages rose from 181 to 187, representing a 3% increase compared to the previous week.
-
During this period, the United States experienced a similar trend, with outages increasing by 10% in the first week (September 21 - 27). This was followed by an even larger increase the following week, with outages rising from 69 to 86, representing a 25% increase compared to the previous week.
-
From October 21 to November 3, an average of 42% of all network outages occurred in the United States, maintaining the same level as the previous period from September 30 to October 20. This trend aligns with a pattern often seen this year, where U.S.-centric outages typically account for at least 40% of all recorded outages.
-
In October, 792 outages were observed globally, a 4% increase from the 763 outages recorded in September. In the United States, outages rose by 8%, increasing from 308 in September to 333 in October. This trend is consistent with what occurred in 2023, as total outages both globally and in the U.S. also rose from September to October.