Alert Fatigue is a common problem in monitoring systems, and happens whenever there’s an excess of alerts being triggered that are mostly false positives, i.e. noise. Finding a good balance of false negatives and false positives is somewhat of an art rather than a science. In this article I will talk about some techniques to mitigate alert fatigue.
1. Automatic baselining instead of fixed-threshold alerts Automatic baselining is the capability of automatically establishing a baseline for a certain metric and alert on statistical deviations from this metric. Auto baselines are mostly useful when data shows periodic variations (e.g. traffic patterns), or when the value of the alert threshold varies per location. There can be cases though where you still want to keep manual static thresholds, for example, for packet loss, route reachability, and availability.
2. Secondary filters For a given asset, e.g. a URL, alerts can fire at different locations at the same time. Assuming there are a total of N locations, alerting on a single location might not be enough for the user to want to receive an alert. For example, users might only be alerted if the alert spans across more than one country geographically, or more than one network. In these cases, it’s good to have a secondary filter on the number of locations alerted, e.g. “Only alert me if more than one location is affected by this alert rule”. Another type of secondary filter is based on time, i.e. “Only alert me if this rule is fired more than once in a row”.
3. Reference metrics to filter local problems A common nemesis of external monitoring systems are local problems in the agent locations. For instance, if the agent can’t reach the default gateway, it will bring the value of packet loss to 100% to all the targets, and suddenly you have a flood of alert emails in your mailbox. At ThousandEyes, we found a simple yet effective solution to this problem. Whenever we add a new public agent to our infrastructure, we schedule a set of tests from N other agents to that agent as the target. Therefore we know when the agent where tests are done from is experiencing network problems. In the timeline figure below, we can notice the gray bands below the timeline, they correspond to local network connectivity with an agent located in Phoenix, AZ. We can use this information to suppress alerts from the specific location whenever it’s experiencing local problems.
4. Reasonable default alert rules It’s important to have a default set of alert rules for each test, since that’s what most users are going to work with. Therefore, default alert rules are a good opportunity to fine tune the balance between false negatives and false positives. A good starting point is to select metrics that are associated with availability instead of speed, e.g. packet loss, DNS resolution, etc. And thresholds that err on minimizing false positives.
5. Aggregate multiple alerts into a single message This is a trick to minimize the number of emails sent for alerts that overlap in time. Lets say there is one alert being triggered for each of N different targets at a given time. Wouldn’t it be better to receive a single email with all these alerts rather than N different messages? If alerts are happening at same time, there’s also a chance they are correlated, so seeing them in a single message makes sense.
Along with the points described above, it’s also important for a monitoring system to provide an API where alerts can be fetched. The API can be used to download raw alerts, and have logic at the user side to filter what is relevant and what is not. In some sense this is pushing the complexity of alert processing to the user side for more advanced use cases.
At ThousandEyes we spend a significant effort to reduce the number of false positive alerts and alert fatigue in general, while making sure we have sensitive default alert rules to prevent obvious false negatives.