Cisco ThousandEyes collects performance data from a wide range of vantage points distributed across the public internet and within customer networks. These vantage points continuously monitor availability, stability, and end-to-end performance to help teams troubleshoot network and application issues.
Tests are typically executed at regular intervals—often every few minutes—producing a steady stream of performance measurements. Over time, these measurements form time series data that reflect the normal operating behavior of the network.
When a metric suddenly deviates from its expected pattern, it often signals an underlying issue, such as congestion, routing instability, or service disruption. Detecting these deviations early is important for identifying and resolving network problems before they impact users [1, 2, 5].
Many anomaly detection systems in production environments rely on rule-based thresholding methods, most commonly statistical deviation techniques such as moving z-scores or moving interquartile range (IQR) [2, 3, 4]. These approaches are appealing because they are simple, interpretable, and relatively easy to operationalize at scale.
However, their simplicity comes at a cost. Rule-based thresholding does not explicitly model the underlying structure of time series data—such as seasonality, daily traffic cycles, workload-driven patterns, or long-term trends. As a result, predictable and recurring fluctuations are often misclassified as anomalies. In dynamic systems like network environments, this leads to excessive false positives and alert fatigue.
Another fundamental challenge lies in selecting the right threshold. If the threshold is set too high, meaningful anomalies may be missed. If it is set too low, the system produces an excessive number of alerts, overwhelming teams with false positives. Striking the right balance is difficult, especially in dynamic environments where normal behavior continuously evolves.
To move beyond generic thresholds, we need to define what truly matters. That means providing representative examples of meaningful incidents and leveraging them to train machine learning (ML) models that can learn complex behavioral patterns. By incorporating context and historical patterns into the detection process, ML-based approaches can better distinguish between routine variability and genuine performance degradation—delivering more precise, reliable anomaly detection.
Anomaly Patterns on Ping Latency
For this study, we concentrated on ping latency, defined as the average round-trip time of packets between a test endpoint and an agent. Latency is a key indicator of network health: when it increases unexpectedly, it often signals congestion, routing issues, or service degradation.
In our framework, sudden increases in latency for a given <test, agent> pair are treated as potential network anomalies. These upward deviations are surfaced to downstream systems—such as event detection pipelines and alerting mechanisms—so they can trigger investigation or automated response.
Importantly, latency behaves as an upper-bound anomaly metric in time-series analysis. We are primarily concerned with abnormal increases. When latency rises beyond expected patterns, it is flagged as anomalous and requires action. Conversely, decreases in latency are considered favorable outcomes, as lower latency reflects improved network performance.
We generated 14 days of time series data from ping latency and analyzed different patterns of the signals.
There are various patterns in time series data:
-
Spikes on steady signals: In this scenario, latency remains consistently stable, hovering around 70 ms with minimal variance. The signal exhibits a tight, well-defined baseline, making deviations easy to identify. When latency abruptly spikes to 545 ms, the change is both sudden and substantial relative to the historical pattern. Given the stability of the signal, this sharp increase clearly stands out as an anomaly and warrants immediate attention.
Figure 1: Spike Anomalies on steady signal -
Long Term Anomaly: In this type of signal, the metric remains stable at around 70 ms before experiencing a sustained increase in latency. The elevated values persist for more than 10 hours before eventually returning to the normal baseline. For long-duration anomalies like this, it is important not only to detect the initial spike but also to capture the full span of the degradation.
Figure 2: Long term anomaly lasts several hours -
Multi-modal Signal: In this type, the signal oscillates between distinct value bands, repeatedly moving back and forth within these ranges. Some observations cluster around lower values (e.g., ~20 ms), while others rise to higher levels (e.g., ~110 ms).
Although the variation may appear significant at first glance, all values fall within the expected operating baseline. The shifts between bands reflect normal system behavior rather than anomalous events, and therefore no anomaly is present in this signal.
Figure 3: No Anomaly: Values go back and forth among different bands without an anomaly -
Periodic Signal: In this type, we observe recurring increases at roughly the same time each day, each lasting approximately two to four hours. Because these spikes occur at a consistent, periodic cadence, they represent expected cyclical behavior rather than true anomalies. Such patterns are typically associated with predictable workload shifts or daily usage cycles and should be modeled as part of the system’s normal baseline.
Figure 4: No Anomaly: Values go up around the same time of the day for couple of hours, no anomaly
-
High Variance Signal: This category includes signals that exhibit large fluctuations without a consistent or recognizable pattern. The values frequently move up and down, often spanning a wide range—typically between 250 ms and 500 ms in variance. Despite the magnitude of these swings, such behavior does not necessarily indicate an anomaly.
Figure 5: No Anomaly: Highly volatile signal with high variance
Dataset Preparation
After analyzing variance patterns in the time series data, we built a labeled dataset designed to capture a diverse range of both anomalous and normal behaviors. We generated 500 time series samples to provide representation across different signal patterns and variability types. Using Label Studio, domain experts manually annotated anomalies within each series, grounding the dataset in real operational context.
With the human-labeled dataset in place, we trained machine learning models and evaluated them on unseen data from different time periods. Each evaluation cycle revealed misclassified samples and previously underrepresented patterns that were not fully captured in the original dataset.
We then incorporated these misclassified examples back into the labeling workflow. Domain experts reviewed and annotated them, and the expanded dataset was used to retrain the model. This iterative, human-in-the-loop process continued—train, evaluate, refine—until the dataset and model performance reached a satisfactory level.
Machine Learning Models
To detect the anomalies, we have to look at historical data to understand what “normal” looks like. The size of the lookback window plays a critical role in shaping that understanding, as different windows surface different types of anomalies.
Shorter lookback windows are more sensitive to small, sudden deviations. In contrast, longer lookback windows help the model learn broader patterns in the signal—such as daily cycles, recurring peaks, or seasonality—providing a more stable baseline of expected behavior.
Rather than relying on a single long window, combining both short- and long-term perspectives produces a more robust detection strategy. Together, they strengthen anomaly detection and increase confidence in the anomaly.
From each lookback window, we engineered a set of statistical features designed to capture distributional shifts and structural changes in the signal. These features included rolling z-scores, normalized interquartile differences, slopes between the observed value and various percentile thresholds, lags, the quartile coefficient of dispersion, as well as measures of asymmetry and tail behavior. Together, these features provided a compact but expressive summary of how the current observation deviates from its recent context.
We evaluated several machine learning models, including Logistic Regression, XGBoost, Random Forest, LSTM networks, and CNN-based architectures. Among them, tree-based algorithms like Random Forest and XGBoost consistently outperformed the other techniques on the validation dataset, delivering the best balance of precision and recall. Feature importance analysis revealed that rolling z-score–based features and slope-based features were the strongest contributors to model performance.
As with most anomaly detection systems, there is a natural trade-off between recall and precision.
Simpler approaches—such as Logistic Regression or a z-score threshold set at one standard deviation—tend to achieve very high recall. They successfully capture nearly all labeled anomalies. However, this sensitivity comes at a cost: they flag many normal points as anomalous, leading to a high false positive rate and, consequently, low precision.
In contrast, ensemble-based models like Random Forest and XGBoost maintain strong recall while significantly reducing false positives. By better capturing nonlinear relationships and contextual patterns in the data, these models achieve higher precision and improved F1-scores. The result is a more balanced and operationally useful anomaly detection system—one that surfaces meaningful issues without overwhelming downstream alerting pipelines.
Z-score 2std
| Type | Precision | Recall | F-Score |
| Spike | .06 | 1 | .11 |
| Multi-modal | .43 | .78 | .55 |
| Periodic | .45 | .62 | .52 |
| Long Duration | .22 | .54 | .32 |
| High Variance | .22 | .96 | .36 |
Logistic Regression
| Type | Precision | Recall | F-Score |
| Spike | .14 | 1 | .2462 |
| Multi-modal | .09 | 1 | .17 |
| Periodic | .12 | 1 | .21 |
| Long Duration | .07 | 1 | .14 |
| High Variance | .07 | 1 | .13 |
Random Forest
| Type | Precision | Recall | F-Score |
| Spike | .46 | .87 | .6 |
| Multi-modal | .31 | .94 | .47 |
| Periodic | .37 | .83 | .51 |
| Long Duration | .35 | .93 | .5 |
| High Variance | .36 | .86 | .51 |
XGBoost
| Type | Precision | Recall | F-score |
| Spike | .5 | 1 | .66 |
| Multi-modal | .23 | .97 | .38 |
| Periodic | .27 | .88 | .41 |
| Long Duration | .31 | 1 | .48 |
| High Variance | .27 | .87 | .42 |
Time Series Foundational Models on Anomaly Detection
Time Series Foundational Models (TSFMs) are pretrained on large and diverse datasets spanning multiple domains. By learning from heterogeneous signals—ranging from infrastructure metrics to business and environmental data—they develop generalized representations of temporal behavior. This enables knowledge transfer across domains and notably reduces the need for manual feature engineering. In practice, a raw time series can be fed directly into the model, which then forecasts the next n time steps.
Chronos-v2 [6] supports zero-shot univariate and multivariate forecasting, allowing predictions without task-specific retraining. In our setup, we performed one-hour-ahead forecasting using 48 hours of historical context. The model generated probabilistic forecasts at different quantile levels, namely p25, p50, and p75.
To quantify forecast uncertainty, we computed the interquartile range (IQR) from the predicted quantiles:
Chronos_IQR = Forecasted 75th Quantile − Forecasted 25th Quantile
Anomaly Decision Rule: observed value > p75 + 2 × Chronos_IQR
While this approach captures short-term deviations effectively, we observed that Chronos can be sensitive to anomalies within the input history. Large anomaly segments tend to influence subsequent forecasts, elevating predicted values in the post-anomaly period. As a result, prolonged or sustained anomalies may be partially absorbed into the forecast baseline, causing the model to miss longer-term degradations. This behavior stems from the model’s objective of minimizing prediction error over time, which implicitly encourages it to absorb persistent deviations into its learned dynamics rather than maintaining them as outliers. As the forecasting horizon extends, the model also broadens its predicted quantile ranges, increasing uncertainty bands and further masking anomalies by making extreme values more likely to fall within expected bounds.
Sample Outputs
Conclusion
By moving beyond traditional statistical thresholding and adopting Machine Learning (ML) techniques, we enabled the system to learn what truly constitutes an anomaly. Rather than relying solely on fixed statistical deviations, the models were trained on examples that reflected meaningful anomalous behavior in our environment. As a result, we reduced the noise and false positives commonly produced by statistical threshold-based methods, allowing the system to focus on anomalies that matter most to users.
References
-
ThousandEyes Event Detection Platform: https://docs.thousandeyes.com/product-documentation/event-detection
-
ThousandEyes Dynamic baselines https://docs.thousandeyes.com/product-documentation/alerts/creating-and-editing-alert-rules/dynamic-baselines
-
Chase Anomaly Detection, https://medium.com/next-at-chase/cutting-time-to-detect-customer-impact-with-z-score-anomaly-detection-8dd03cbd9227
-
Booking’s Statistical Anomaly Detection System, https://medium.com/booking-com-development/anomaly-detection-in-time-series-using-statistical-analysis-cc587b21d008
-
Detecting Network Anomalies on Latency, https://medium.com/thousandeyes-engineering/detecting-anomalies-in-network-latency-time-series-from-statistical-filters-to-machine-learning-b8098dc61109
-
Amazon Chronos Foundational Model https://github.com/amazon-science/chronos-forecasting