New Podcast
Managing Traffic During Peak Demand; Plus, Microsoft, Akamai Outages

The Internet Report

The SharePoint Outage and Why Manual Changes Need Extra TLC

By Mike Hicks
| | 13 min read
Internet Report on Apple Podcasts Internet Report on Spotify Internet Report on SoundCloud

Summary

In an end-to-end service delivery chain, isolated changes can have broad consequences and manual updates require extra care. Learn how this played out recently at SharePoint and explore other recent outages that impacted Slack, Starbucks, and NASA.


This is the Internet Report: Pulse Update, where we analyze outages and trends across the Internet, from the previous two weeks, through the lens of ThousandEyes Internet and Cloud Intelligence. I’ll be here every other week, sharing the latest outage numbers and highlighting a few interesting outages. As always, you can read the full analysis below or tune in to the podcast for first-hand commentary.

 


Internet Outages & Trends

We often encounter (and analyze) incidents where work in one part of an app or service has an unanticipated flow-on impact. These incidents underscore the importance of understanding the entire service delivery chain in order to be aware of every dependency and interconnection, helping you keep impact and footprint to a minimum.

Having a detailed grasp of all dependencies is particularly important when making manual changes. Organizations today normally rely on highly automated change processes; teams are technology-assisted to begin with; and deployment and rollback may occur with little, if any, human intervention. However, there will always be exceptions where changes need to be made manually, outside of the standardized, automated change process. These manual changes require special care, and can be especially challenging because they require engineering teams to understand the intricacies of change processes, without the usual automated deployment checks and balances to assist them. Mistakes may be more likely to happen, perhaps causing an outage or service disruption.

We saw this play out this past fortnight, as a manual TLS/SSL certificate change by Microsoft introduced the type of error that an automated system would probably have detected and prevented.

Read on to learn more about this outage and other recent incidents, or use the links below to jump to the sections that most interest you.


SharePoint Online and OneDrive Outage

On July 24, Microsoft experienced an issue that impacted connectivity to SharePoint Online and OneDrive for Business services. First observed around 19:05 UTC, it appeared to impact connectivity for users globally. 

Screenshot of ThousandEyes platform showing SharePoint Online and OneDrive Business connectivity impacted globally
Figure 1. SharePoint Online and OneDrive Business connectivity impacted globally. See more in the ThousandEyes platform.

Users encountered a certificate error when attempting to access SharePoint Online and OneDrive due to an erroneous change in the SSL certificate that prevented the establishment of a secure connection to the services. 

Screenshot from ThousandEyes showing a certificate with the incorrect domain name
Figure 2. Certificate issued on July 24 shows incorrect domain name

Approximately ten minutes later, at around 17:15 UTC, it appeared to be replaced with a valid certificate, and SharePoint and OneDrive service reachability was restored for most users by around 17:20 UTC. Around 21:34 UTC, Microsoft announced that the outage was the result of a configuration issue and had been resolved.

A curious aspect of this outage is that it seemed to be triggered by a manual change. Generally speaking, a majority of outages today appear to be triggered by unexpected conditions encountered during the operation of automated change or deployment processes. Detection and rollback are also often automated.

There were likely specific and valid reasons that Microsoft engineers manually replaced the certificate. However, with manual updates, having checks and validations in place to guard against human error is especially important. Microsoft likely had such processes in place, and it’s not clear how the erroneous change made it through, but events like these are good reminders for organizations to make sure they have strategies in place to help catch issues like a domain name mismatch.

The incident (and others like it) also highlight how important it is to have a certificate that’s valid in all aspects (with valid credentials, a correct domain name, etc.), as well as appropriate change verification steps. Every component in an end-to-end service delivery chain, and the teams and individuals responsible for those components, need to work in sync to maintain the service’s availability. Any deviation from standard operating procedures and documented change processes, can introduce risk. It only takes a degradation or outage in one component to have a flow-on impact, potentially taking out the entire service. Given this reality, some teams may choose to invest in tools that provide them visibility and early warning into things like soon-to-expire certificates—or put other strategies in place to guard against such issues.

Slack Outage

Slack experienced a “systemwide” issue on July 27 that left some users unable to send or receive messages for just under one hour. The issues occurred very early in the North American morning—just after 2 AM PDT—which would suggest system work was intended to happen outside of U.S. working hours (though it would have been mid-morning for parts of Europe).

A brief post-incident report notes that an issue was identified “after a change was made to a service that manages our internal system communication. This resulted in degradation of Slack functionality until the change was reverted which resolved the issue for all users.”

During the incident, users may have found it difficult to recognize or diagnose from a customer-facing perspective because everything would have “looked right.” Users could still connect, their availability status was correct, and they could read messages in Slack and even think they’d responded.

However, everything wasn’t normal. It appeared that their replies weren’t actually being sent. While things looked ok on the frontend, the backend components of the system could not be reached. This is a common outage pattern observed consistently with app outages.

Like the SharePoint and OneDrive outage discussed earlier, this Slack outage appears to be a second case of a change being made to a particular component of an app in isolation, without visibility into how that might impact end-to-end service delivery. 

Starbucks App Outage

On July 20, Starbucks sent a push notification through its app notifying customers their “order is ready”—whether they’d ordered a coffee or not. 

The glitch caused some confusion among customers, though we also suspect it might have triggered a surge in coffee demand as well. 

The mass push notification coincided with a partial app outage that affected one specific portion of the Starbucks app’s functionality: the order ahead and pay feature. However, it’s unclear if the outage and mistaken push notification are related. Certainly, the push notification issue suggests there was active work on the messaging portion of the app. You’d also expect some sort of acknowledgement notification to be a function of the order ahead and pay feature, which was impacted by the outage. As Starbucks attempted to fix the outage, it’s possible that a change was introduced or a test message was mistakenly pushed to production, which may have been the erroneous “order is ready” notification. 

Again, as demonstrated by the SharePoint and Slack disruptions, every piece of the service delivery chain has a critical role to play. 

NASA Communications Outage

Finally, NASA experienced a communications outage with the International Space Station (ISS) on July 25 that impacted command, telemetry, and voice communications. The incident represented the first time backup communications had to be switched on. 

The root cause of the issue was reportedly a power outage stemming from upgrade works in the building housing Mission Control at NASA’s Johnson Space Center in Houston. The outage appeared to solely impact the communications only, which essentially means that this was a ground-only issue, most likely only impacting local connections into Mission Control. 

While we don’t normally discuss space comms in this blog, this outage reinforces a theme that we often emphasize: the importance of robust failover processes. Whether serving users in space or a bit closer to home, it’s vital to understand the what and why of any issues that may occur in order to quickly ascertain and action the most appropriate plan or process.

NASA maintains a backup control center; however, during the July 25 outage, the flight controllers stayed at Mission Control in Houston as it appeared that the lights and air-conditioning were still operating. This potentially points to quick identification of the cause and confidence in resolution time. NASA noted that they would try to better understand what happened and glean any resulting lessons. 

As we’ve discussed in previous blogs, while it’s difficult to plan for every possible scenario, understanding what worked and what didn’t work when an outage occurs can be used to improve systems and processes, reducing the likelihood of the issue recurring. For example, in this case, in addition to their backup control center, NASA may also consider providing separate and/or backup power sources for their communication system.


By the Numbers

In addition to the outages highlighted above, let’s close by taking a look at some of the global trends we observed across ISPs, cloud service provider networks, collaboration app networks, and edge networks over the past two weeks (July 17-30):

  • Global outages trended downwards over this two-week period, initially dropping from 192 to 186, a slight 3% decrease when compared to July 10-16. This was followed by another drop from 186 to 156, a 16% decrease compared to the previous week (see the chart below).

  • This pattern was reflected in the U.S., where outages initially dropped from 96 to 74, a 23% decrease when compared to July 10-16. U.S. outage numbers then dropped from 74 to 60 the next week, a 19% decrease.

  • U.S.-centric outages accounted for 42% of all observed outages from July 17-30, which is somewhat smaller than the percentage observed between July 3-16, where they accounted for 51% of observed outages. While this was a drop, it continues the trend observed since April, in which U.S.-centric outages have accounted for at least 40% of all observed outages.

Chart showing global and U.S. network outage trends over the past eight weeks.
Figure 3. Global and U.S. network outage trends over the past eight weeks.

Subscribe to the ThousandEyes Blog

Stay connected with blog updates and outage reports delivered while they're still fresh.

Upgrade your browser to view our website properly.

Please download the latest version of Chrome, Firefox or Microsoft Edge.

More detail