Depending on your perspective, October the 4th might have felt like the day that the Internet collapsed. And that’s the thing — it all depends on your perspective as to how outages across the Internet will affect you. In this blog series, we’ll aim to give you insight into the mood of the Internet each week, highlighting not just notable outages but also trends and nuances around different outages and how they can affect the disposition and perception of the health of the Internet.
The truth is that the Internet didn’t collapse on October 4th; it was just Facebook and its services WhatsApp and Instagram that became inaccessible, causing much consternation among its many users. But for everyone else, the Internet was available, functioning and accessible. In fact it held up very well, even in the face of increased loads and requests as Facebook, WhatsApp and Instagram users sought to leverage alternatives to communicate.
Our team of Internet experts have already provided a deep dive, explaining exactly how the outage manifested itself, and Facebook provided their report of what happened, post the event. We now know there were no nefarious actors behind the outage, rather, a configuration change was to blame.
What made the Facebook configuration change noteworthy from a networking perspective isn’t necessarily its collateral damage, but more the timing — in U.S. business hours — and its mean time to recovery.
Globally, only 38% of all outages last week, occurred during business hours. That volume isn’t too surprising given an increase in software defined architectures, coupled with a global 24x7 uptime requirement, causing more planned maintenance to occur inside of business hours than it used to.
In the U.S., planned maintenance typically occurs during the work week. Since early May, we’ve seen maintenance work globally peaking on Fridays, aligning with a high instance of outages now also being concentrated on Fridays as well.
A lot of the outages we observe appear to be the result of planned maintenance or engineering work, and Facebook is no exception. What undid Facebook wasn’t the maintenance work per se. After all, this was the type of task they’ve performed many times before; but the flow-on effects, due to the architecture of its applications and networks, and the architecture of the Internet itself.
Whatever Facebook did took out their authoritative nameservers, which then believing the network connection to be “unhealthy,” stopped advertising the routes to Facebook's servers, rendering the services inaccessible across the Internet. The result was that Facebook looked unreachable, as was noted in Wired: Facebook fell “off the Internet’s map.” There was no path that providers could use to send Facebook traffic to.
The unprecedented nature of the outage — for Facebook at least — is likely to involve some substantial post-incident reporting (PIRs) and planning. But there are important lessons here for Facebook and its engineers because this is an unusually severe incident from a planned maintenance activity.
There is nothing wrong with undertaking maintenance or engineering work during business hours. On the contrary, given the global 24x7 nature of business that is now powered by the Internet, it is not possible to always schedule outside of everyone's business hours. In a follow-the-sun world, it's always someone's business hours. But it is important to understand the potential disruption the maintenance work could have in the event of something going wrong, so that should something go awry, it would have the least impact to the majority of the user population.
Case in point, within the same 24 hours as Facebook, a large collaboration application provider ran its own scheduled maintenance works. While that operator is also global in scale, the impacts of its works were isolated. As the nature of the works were deemed potentially disruptive, they were scheduled in the early hours of a Sunday U.S. time, thereby limiting the chance of users being on business calls. During the outage, established calls would briefly drop before being re-established out of the closest data center not under maintenance.
This of course is nothing new; a risk assessment component has been part of all change control processes as long as networks have been around, and in Facebook's defense they cite a bug in one of their audit tools that ordinarily would have stopped the configuration push, thereby reducing the risk of a business hours maintenance window. But if you’re able to visualize a service from end to end, chances are you’ll uncover some hidden dependencies that you weren’t fully aware of, enabling the preparation of contingency plans.
One takeaway from the Facebook outage is that there is no "one-size-fits-all" for outages, and they can happen to anyone. Rather, a combination of application architecture and underlying network infrastructure, combined with not just time-of-day but time-of-day relative to your user population, that will ultimately determine the impact of a potential disruption. The greater your understanding of critical dependencies, the better your pre-planning mitigation strategies will be, as well as your ability to schedule, localize, and reduce the impact to your users. And that really is the best outcome that application and tech teams can hope for.