From Noise to Signal: Machine Learning Anomaly Detection for Network Telemetry

Cisco ThousandEyes collects performance data from a wide range of vantage points distributed across the public internet and within customer networks. These vantage points continuously monitor availability, stability, and end-to-end performance to help teams troubleshoot network and application issues.

Tests are typically executed at regular intervals—often every few minutes—producing a steady stream of performance measurements. Over time, these measurements form time series data that reflect the normal operating behavior of the network.

When a metric suddenly deviates from its expected pattern, it often signals an underlying issue, such as congestion, routing instability, or service disruption. Detecting these deviations early is important for identifying and resolving network problems before they impact users [1, 2, 5].

Many anomaly detection systems in production environments rely on rule-based thresholding methods, most commonly statistical deviation techniques such as moving z-scores or moving interquartile range (IQR) [2, 3, 4]. These approaches are appealing because they are simple, interpretable, and relatively easy to operationalize at scale.

However, their simplicity comes at a cost. Rule-based thresholding does not explicitly model the underlying structure of time series data—such as seasonality, daily traffic cycles, workload-driven patterns, or long-term trends. As a result, predictable and recurring fluctuations are often misclassified as anomalies. In dynamic systems like network environments, this leads to excessive false positives and alert fatigue.

Another fundamental challenge lies in selecting the right threshold. If the threshold is set too high, meaningful anomalies may be missed. If it is set too low, the system produces an excessive number of alerts, overwhelming teams with false positives. Striking the right balance is difficult, especially in dynamic environments where normal behavior continuously evolves.

To move beyond generic thresholds, we need to define what truly matters. That means providing representative examples of meaningful incidents and leveraging them to train machine learning (ML) models that can learn complex behavioral patterns. By incorporating context and historical patterns into the detection process, ML-based approaches can better distinguish between routine variability and genuine performance degradation—delivering more precise, reliable anomaly detection.

Anomaly Patterns on Ping Latency

For this study, we concentrated on ping latency, defined as the average round-trip time of packets between a test endpoint and an agent. Latency is a key indicator of network health: when it increases unexpectedly, it often signals congestion, routing issues, or service degradation.

In our framework, sudden increases in latency for a given <test, agent> pair are treated as potential network anomalies. These upward deviations are surfaced to downstream systems—such as event detection pipelines and alerting mechanisms—so they can trigger investigation or automated response.

Importantly, latency behaves as an upper-bound anomaly metric in time-series analysis. We are primarily concerned with abnormal increases. When latency rises beyond expected patterns, it is flagged as anomalous and requires action. Conversely, decreases in latency are considered favorable outcomes, as lower latency reflects improved network performance.

We generated 14 days of time series data from ping latency and analyzed different patterns of the signals.

There are various patterns in time series data:

Spikes on steady signals: In this scenario, latency remains consistently stable, hovering around 70 ms with minimal variance. The signal exhibits a tight, well-defined baseline, making deviations easy to identify. When latency abruptly spikes to 545 ms, the change is both sudden and substantial relative to the historical pattern. Given the stability of the signal, this sharp increase clearly stands out as an anomaly and warrants immediate attention.

Figure 1: Spike Anomalies on steady signal
Long Term Anomaly: In this type of signal, the metric remains stable at around 70 ms before experiencing a sustained increase in latency. The elevated values persist for more than 10 hours before eventually returning to the normal baseline. For long-duration anomalies like this, it is important not only to detect the initial spike but also to capture the full span of the degradation.

Figure 2: Long term anomaly lasts several hours
Multi-modal Signal: In this type, the signal oscillates between distinct value bands, repeatedly moving back and forth within these ranges. Some observations cluster around lower values (e.g., ~20 ms), while others rise to higher levels (e.g., ~110 ms).

Although the variation may appear significant at first glance, all values fall within the expected operating baseline. The shifts between bands reflect normal system behavior rather than anomalous events, and therefore no anomaly is present in this signal.

Figure 3: No Anomaly: Values go back and forth among different bands without an anomaly
Periodic Signal: In this type, we observe recurring increases at roughly the same time each day, each lasting approximately two to four hours. Because these spikes occur at a consistent, periodic cadence, they represent expected cyclical behavior rather than true anomalies. Such patterns are typically associated with predictable workload shifts or daily usage cycles and should be modeled as part of the system’s normal baseline.

Figure 4: No Anomaly: Values go up around the same time of the day for couple of hours, no anomaly
High Variance Signal: This category includes signals that exhibit large fluctuations without a consistent or recognizable pattern. The values frequently move up and down, often spanning a wide range—typically between 250 ms and 500 ms in variance. Despite the magnitude of these swings, such behavior does not necessarily indicate an anomaly.

Figure 5: No Anomaly: Highly volatile signal with high variance

Dataset Preparation

After analyzing variance patterns in the time series data, we built a labeled dataset designed to capture a diverse range of both anomalous and normal behaviors. We generated 500 time series samples to provide representation across different signal patterns and variability types. Using Label Studio, domain experts manually annotated anomalies within each series, grounding the dataset in real operational context.

With the human-labeled dataset in place, we trained machine learning models and evaluated them on unseen data from different time periods. Each evaluation cycle revealed misclassified samples and previously underrepresented patterns that were not fully captured in the original dataset.

We then incorporated these misclassified examples back into the labeling workflow. Domain experts reviewed and annotated them, and the expanded dataset was used to retrain the model. This iterative, human-in-the-loop process continued—train, evaluate, refine—until the dataset and model performance reached a satisfactory level.

Figure 6: Continuous improvement of the labeled dataset with ML

Machine Learning Models

To detect the anomalies, we have to look at historical data to understand what “normal” looks like. The size of the lookback window plays a critical role in shaping that understanding, as different windows surface different types of anomalies.

Shorter lookback windows are more sensitive to small, sudden deviations. In contrast, longer lookback windows help the model learn broader patterns in the signal—such as daily cycles, recurring peaks, or seasonality—providing a more stable baseline of expected behavior.

Rather than relying on a single long window, combining both short- and long-term perspectives produces a more robust detection strategy. Together, they strengthen anomaly detection and increase confidence in the anomaly.

Figure 7: Different lookback windows for feature extraction

From each lookback window, we engineered a set of statistical features designed to capture distributional shifts and structural changes in the signal. These features included rolling z-scores, normalized interquartile differences, slopes between the observed value and various percentile thresholds, lags, the quartile coefficient of dispersion, as well as measures of asymmetry and tail behavior. Together, these features provided a compact but expressive summary of how the current observation deviates from its recent context.

We evaluated several machine learning models, including Logistic Regression, XGBoost, Random Forest, LSTM networks, and CNN-based architectures. Among them, tree-based algorithms like Random Forest and XGBoost consistently outperformed the other techniques on the validation dataset, delivering the best balance of precision and recall. Feature importance analysis revealed that rolling z-score–based features and slope-based features were the strongest contributors to model performance.

As with most anomaly detection systems, there is a natural trade-off between recall and precision.

Simpler approaches—such as Logistic Regression or a z-score threshold set at one standard deviation—tend to achieve very high recall. They successfully capture nearly all labeled anomalies. However, this sensitivity comes at a cost: they flag many normal points as anomalous, leading to a high false positive rate and, consequently, low precision.

In contrast, ensemble-based models like Random Forest and XGBoost maintain strong recall while significantly reducing false positives. By better capturing nonlinear relationships and contextual patterns in the data, these models achieve higher precision and improved F1-scores. The result is a more balanced and operationally useful anomaly detection system—one that surfaces meaningful issues without overwhelming downstream alerting pipelines.

Z-score 2std

Type	Precision	Recall	F-Score
Spike	.06	1	.11
Multi-modal	.43	.78	.55
Periodic	.45	.62	.52
Long Duration	.22	.54	.32
High Variance	.22	.96	.36

Logistic Regression

Type	Precision	Recall	F-Score
Spike	.14	1	.2462
Multi-modal	.09	1	.17
Periodic	.12	1	.21
Long Duration	.07	1	.14
High Variance	.07	1	.13

Random Forest

Type	Precision	Recall	F-Score
Spike	.46	.87	.6
Multi-modal	.31	.94	.47
Periodic	.37	.83	.51
Long Duration	.35	.93	.5
High Variance	.36	.86	.51

XGBoost

Type	Precision	Recall	F-score
Spike	.5	1	.66
Multi-modal	.23	.97	.38
Periodic	.27	.88	.41
Long Duration	.31	1	.48
High Variance	.27	.87	.42

Time Series Foundational Models on Anomaly Detection

Time Series Foundational Models (TSFMs) are pretrained on large and diverse datasets spanning multiple domains. By learning from heterogeneous signals—ranging from infrastructure metrics to business and environmental data—they develop generalized representations of temporal behavior. This enables knowledge transfer across domains and notably reduces the need for manual feature engineering. In practice, a raw time series can be fed directly into the model, which then forecasts the next n time steps.

Chronos-v2 [6] supports zero-shot univariate and multivariate forecasting, allowing predictions without task-specific retraining. In our setup, we performed one-hour-ahead forecasting using 48 hours of historical context. The model generated probabilistic forecasts at different quantile levels, namely p25, p50, and p75.

To quantify forecast uncertainty, we computed the interquartile range (IQR) from the predicted quantiles:

Chronos_IQR = Forecasted 75th Quantile − Forecasted 25th Quantile

Anomaly Decision Rule: observed value > p75 + 2 × Chronos_IQR

While this approach captures short-term deviations effectively, we observed that Chronos can be sensitive to anomalies within the input history. Large anomaly segments tend to influence subsequent forecasts, elevating predicted values in the post-anomaly period. As a result, prolonged or sustained anomalies may be partially absorbed into the forecast baseline, causing the model to miss longer-term degradations. This behavior stems from the model’s objective of minimizing prediction error over time, which implicitly encourages it to absorb persistent deviations into its learned dynamics rather than maintaining them as outliers. As the forecasting horizon extends, the model also broadens its predicted quantile ranges, increasing uncertainty bands and further masking anomalies by making extreme values more likely to fall within expected bounds.

Sample Outputs

Figure 8: Model Outputs on Low Variance Spike Anomaly: Z-score tags a lot of points as anomaly and post anomaly region gets affected by the high values; Chronos catches initial part of the anomaly regions but normalizes the rest of the anomaly, it also tags several point after anomalous region subsides; Random Forest catches the desired anomaly regions without creating false alarms

Figure 9: Model Outputs on Periodic Anomaly: Z-score tags top points of the periodic signals as anomaly; Chronos tags the rise of the signals, it doesn’t catch the top part which was positive comparing to statistical approaches; Random Forest doesn’t catch periodic signals as an anomaly.

Conclusion

By moving beyond traditional statistical thresholding and adopting Machine Learning (ML) techniques, we enabled the system to learn what truly constitutes an anomaly. Rather than relying solely on fixed statistical deviations, the models were trained on examples that reflected meaningful anomalous behavior in our environment. As a result, we reduced the noise and false positives commonly produced by statistical threshold-based methods, allowing the system to focus on anomalies that matter most to users.

References

ThousandEyes Event Detection Platform: https://docs.thousandeyes.com/product-documentation/event-detection
ThousandEyes Dynamic baselines https://docs.thousandeyes.com/product-documentation/alerts/creating-and-editing-alert-rules/dynamic-baselines
Chase Anomaly Detection, https://medium.com/next-at-chase/cutting-time-to-detect-customer-impact-with-z-score-anomaly-detection-8dd03cbd9227
Booking’s Statistical Anomaly Detection System, https://medium.com/booking-com-development/anomaly-detection-in-time-series-using-statistical-analysis-cc587b21d008
Detecting Network Anomalies on Latency, https://medium.com/thousandeyes-engineering/detecting-anomalies-in-network-latency-time-series-from-statistical-filters-to-machine-learning-b8098dc61109
Amazon Chronos Foundational Model https://github.com/amazon-science/chronos-forecasting

Engineering

From Noise to Signal: Machine Learning Anomaly Detection for Network Telemetry

Summary

Anomaly Patterns on Ping Latency

Dataset Preparation