The three-week upward outage trend came to an end this week as total global outage numbers declined 19% compared to the previous week. However, the numbers for the U.S. rose slightly, leading to a 5% increase in observed outages for this week. This increase in U.S. outages, in conjunction with the decrease in global outage numbers, means that the U.S. outages attributed 43% of all observed outages last week. This is a 10% increase over the previous week, where they only accounted for 33%, and it marks a return (for the first time this year) to percentages over 40% — with the average for 2021 observed around 47%.
Beyond the outage numbers, this week we saw a continuation or escalation of two sets of outages.
First, Discord experienced a two-hour outage on Wednesday, January 26 just before lunch Pacific Time. With more than 150 million active users a month, this outage was quickly felt—and not just by consumer users and gamers. Discord is also a key collaboration tool used by some virtual offices, providing an open communication channel for enterprise developers and other types of teams.
According to the official incident report, Discord’s problems initially manifested as a “widespread API outage.” Discord's engineers quickly recognized the underlying issue but also found themselves “dealing with a secondary issue on one of [its] database clusters.” The full on-call response team was online and responding to the issue to get everything back up.
Discord’s engineers appeared to take several actions in response, including rate-limiting login traffic and “turning down” parts of its system, with the aim of reducing load on its databases, reducing errors and bringing things back online in a controlled manner.
The other outage (or set of outages) to dig into this week is with Solana, a blockchain player that has marketed itself on being able to transact at speed and, theoretically, process tens of thousands of transactions per second more than some of its rivals. But after a sixth serious outage this month, it is facing some questions.
The Solana blockchain has been chosen by a number of emerging development houses as the protocol and platform upon which to base their metaverse plays. But the caution for any provider of technology underpinning the metaverse is the number of players in what is still a rapidly emerging space. That said, the extent to which a single provider’s string of outages forces developers to rethink where they can best hone and scale up their own metaverse ambitions remains to be seen.
One thing to point out is that both Discord and Solana were relatively open technically about the cause of the outages, as well as the mechanics of their responses. We’ve previously analyzed the importance as a service provider of being transparent; well-informed users are less likely to be angry at a degradation or outage.
But both sets of outages also raise questions about application design. Too often, we see application designs that ignore or fail to factor in the constraints of underlying interdependencies, such as the networks over which application traffic flows or backend APIs. As many of us know and have experienced firsthand, too often the network is blamed as the bottleneck, whereas in reality the problem lies elsewhere. Which is just one reason why application owners need to be able to test external APIs at a granular level directly, from within the context of their core application (instead of only through a front-end interaction), as well as understand the impact of the underlying network transport.
As applications become more complex, apps falling out of sync with their underlying networks are likely to increasingly present a bottleneck that leads to degradation of services and outages that we’ll see in our weekly pulse.
Only time will tell how often this occurs.