ThousandEyes is a network monitoring platform that provides deep insights and visibility into applications and networks in the cloud. Global Fortune 500 companies rely on ThousandEyes to monitor their business critical services. It’s absolutely mandatory for us to adhere to a high level of service availability so our customers can be successful. We monitor our own physical infrastructure and software platform to make it possible for the service to operate 24x7, all year long. To serve more customers and higher network growth rates while following our vision, we realized that we needed alternatives to traditional infrastructure monitoring practices utilizing centralized monitoring through Nagios and check based alerts. After going through several discussions in order to fully understand key requirements, we decided to build an infrastructure monitoring platform for our engineering department.
In this blog post, we will go through the history of internal infrastructure monitoring at ThousandEyes. We will see how it evolved from a single Nagios instance to an infrastructure monitoring platform in itself, hosting multiple internal services. The infrastructure is designed to be highly available, enabling each data center to operate independently in the event of an outage. To serve future growth, we built it with scalability in mind. We’ll share our experience and insights in designing and running it and our vision for the future.
Legacy Monitoring Architecture
When I joined the ThousandEyes operations team in 2014, we were running a single instance of Nagios with check_mk. With this setup, we distributed checks to all our servers using our configurations management system, instead of configuring them centrally on nagios. This worked out better than a vanilla Nagios setup. But had its limitations.
At the time, we were just getting started with metrics for our systems and applications. We had high availability (HA) setups for the two popular time series databases Influxdb and Graphite. The former was being used for server (cpu, memory, disk) and service (mysql, nginx, mongodb.) metrics collected by Telegraf and the latter was gaining popularity among developers for pushing application metrics through statsd.
Our legacy infrastructure monitoring architecture is depicted in Figure 1. We were using Nagios for custom metrics and alerting, Influxdb for system and service metrics and Graphite for application metrics. We quickly realized that this wasn’t the most efficient choice. Why? Well, because:
- There was no easy way to generate alerts from the metrics gathered. Nagios was the tool for evaluating alert rules but we had metrics in two different data stores.
- Developers rely on SREs to write and deploy alerts. This added an unnecessary step and slowed down productivity for both teams.
- Lack of Host/Service discovery. Targets had to be manually added to Nagios.
We learned two things during this time: a) we love metrics and b) we could use metrics in much more creative ways to gain insights into our infrastructure. This served as the initial spark and motivation to invest time into building a resilient infrastructure monitoring platform.
We decided to address the limitations of our legacy infrastructure monitoring with an internally designed infrastructure monitoring platform we call Insights. We wanted to empower teams with automation capabilities to easily add metrics for existing and new services, write alerts for their infrastructure and route them wherever they wish. The Site Reliability Engineering (SRE) team would only need to be responsible for the management of the Insights platform to ensure the highest level of availability for our infrastructure.
Designing for Scale and Availability
The core vision for the new infrastructure monitoring platform was to be able to monitor multiple data center environments as we scaled in the future. The logical step was to move the monitoring and alerting components out of the current data center (DC1) to a cloud platform (CP). The CP would be connected to each data center securely via VPN, collect metrics and monitor all components. It would be away from the blast radius of an outage, in the unlikely event of a DC going down and would be a great way to correlate and collect data for long-term analysis. While this made sense it presented some interesting challenges. For example:
- How do you continue getting metrics and alerts for each DC when the CP has an outage.
- How do you continue getting metrics and alerts for each DC when VPN connectivity to CP goes down.
- And of course, how do you get alerts and, if possible, metrics for the Insights platform itself when CP goes down.
The design could be considered analogous to a distributed data store. From the perspective of CAP theorem, we chose the availability of services and ability to function in the event of a network partition over consistency. It was acceptable if the Insights platform wasn’t always aware of the global state (through metrics and alerts). For instance, it was deemed acceptable if the Insights platform couldn’t collect metrics for DC1 in the event of an outage, as long as they continue to be available locally at DC1 itself. We chose local visibility and availability over consistency in our central platform. We addressed this from the very beginning by designing each DC’s monitoring to be self-reliant. Every DC would run a common, stand-alone environment for metrics collection, processing, alerting and charting.
Prometheus, a popular alerting and monitoring solution, sits at the core of Insights design. In addition to its integrated alerting capability, we liked its notion of target discovery. And as we started the project, it was the only solution with out-of-the-box support for Kubernetes. It is easy to manage, has a powerful query language and due to its flexible target discovery, we were able to set it up to collect metrics from most of our infrastructure within a day. Needless to say, we were impressed.
Prometheus worked well but only supported vertical scaling. Their recommended architecture is to have different instances for different segments of infrastructure. This resonated with our initial thought of separating components for each service team and hence we decided to split Prometheus per engineering team. We didn’t just stop there. We configured Prometheus instances to do smart discovery for—containers, VMs, physical hosts etc.—as well as every piece of infrastructure owned by the team. We covered basic alerts, but teams would be responsible for writing alerts specific to their services. With this model, we made all the engineering teams independent.
Prometheus has out of the box support for various infrastructure platforms like EC2, GCP or OpenStack. This works well for platforms that leverage them but not so well for many small to midsize organizations which are mostly operating bare-metal servers and a few Virtual Machines (VMs). Incidentally, this was the case with us. We managed our server fleet using puppet and family. Due to the relatively static nature of the fleet, a robust discovery platform wasn’t needed. We decided to leverage our existing tools and automate this using puppetdb and Prometheus’ file based discovery.
We created a custom fact for distinguishing nodes for different teams and through puppetdb’s REST API, we were able to query respective nodes in a single curl command. This runs as a cron job and new nodes are discovered every 15 minutes.
We did miss the lack of a scalable Time Series DataBase (TSDB) with Prometheus. Back then Prometheus supported Influxdb as an output endpoint. We tested this but the data model for metrics in Prometheus wasn’t optimized for Influxdb. It resulted in high cardinality metric series and Influxdb instances would balloon up due to memory only indices and die periodically. We chose to retain only a month of data in Prometheus and address long-term metrics storage in the future.
To centrally collect metrics, for each Prometheus instance running locally in a data center, we ran a mirrored instance in the cloud environment—CMP. The mirrored instances are simply configured to federate all relevant data from their local counterparts. This federated interval matches the minimum scrape interval configured on the target instance (generally 30 seconds) to avoid any data loss.
This is a very rudimentary approach for collecting metrics centrally and causes gaps in data during VPN or Prometheus outages. In order to avoid this, we are migrating to a scalable TSDB using Prometheus’ remote read/write endpoint and will share our experience with this approach soon.
Telegraf is our primary metrics collector. We came across it while testing Influxdb and loved its pluggable architecture and the support community. In addition to continue using it as part of our new design, we decided to invest our time and effort into its development and contribute back. Currently, Telegraf runs as de-facto metrics collector across our infrastructure.
Alertmanager (AM) is part of Prometheus family and is a generic alert management utility. It provides alert aggregation, complex alert routing and silencing capabilities. Currently it routes alerts to PagerDuty and team channels on Slack. We use it as a generic alert routing service in our infrastructure. This enables non-Prometheus services to also dispatch alerts through AM. For instance, we use Burrow for monitoring Kafka consumer lags and through its webhook based notifications, its configured to dispatch lag alerts to AM. This allows existing monitoring utilities to be integrated with AM without using Prometheus everywhere. Alerts from non-Prometheus services are also routed to AM.
Prometheus instances local to a data center are configured to dispatch alerts to both a local AM and the one running in the CMP. AMs are configured to synchronize their state, thus handling deduplication of alerts and silences.
Highly Available Alerting
Each Prometheus instance dispatches alerts to AM running locally and in CMP.
In the event of a local AM failure, the alert would be routed through a remote AM. Since Prometheus monitors AM, one of the alerts would be local AM being down.
We use Grafana for visualizing metrics. It is fast, capable of tapping into multiple data stores and has a great community. We used it in our legacy monitoring too and absolutely loved it.
To simplify switching across multiple Prometheus instances, we use data sources as variables in our dashboards using Grafana’s templating features. This enables us to deduplicate efforts across teams by using same dashboards for common components. For instance, if multiple teams run their own Elasticsearch clusters, they could use the same dashboard by just selecting their team’s Prometheus instance as the data source.
We also use wizzy to version control and synchronize dashboards and data sources across all Grafana installations. Grafana hosted at the CMP is most often used by engineers. In case it is inaccessible, one can simply switch to the Grafana instance in another location.
We host Grafana at each data center. Grafana hosted at the Insights cloud environment is most often used by engineers. In case it is inaccessible, one can simply switch to a Grafana instance in another location. Multiple hosting introduces interesting challenges around authentication, authorization, data sources and dashboard synchronization.
We use Grafana’s auth.proxy and LDAP to authenticate and authorize a user. We use wizzy to version control and synchronize dashboards and data sources across all Grafana installations. Through this, we can guarantee the same Grafana setup across all installations.
Grafana hosted in the Insights setup serves another important role of monitoring the local AM and can alert when its down. This was important to guarantee an alert in the event of multiple AM failures.
What does the future hold?
The Insights infrastructure has served us very well so far and is under active development. It enabled us to grow our own infrastructure monitoring. There are still some shortcomings, challenges and frontiers we have on our punch list and we will briefly go over some of the areas we are working on.
The current design of mirrored Prometheus instances has few drawbacks. Due to the increasing number of metrics, the federated scrape size and times are gradually going up. An overloaded mirrored instance could easily skip a scrape or timeout, causing a gap in metrics. This gap gets wider during connectivity issues. Due to our architecture, local metrics and alerting continue to function even during such events and only impacts long-term retention.
We plan to use the read/write endpoint of Prometheus 2.x to store metrics in a horizontally scalable data store. We are currently looking into Elasticsearch and Uber’s M3DB. We have extensive experience and tooling around Elasticsearch, and M3DB is a promising new TSDB that we are excited to explore. To improve central collection, the metrics will be queued locally in each DC before pushing them to the TSDB. This way they can be queued when connectivity to Insight is lost and can be consumed as soon as its restored. This resolves the metrics gap issue. We are also excited to explore Thanos. It has an interesting approach to tackle the same issue.
We currently store only 2 months of metrics and want to retain at least a year’s worth of them and make them readily available. We need to downsample the data otherwise the data volume would be too high. Granular metrics may not be needed for long-term storage and others may not be eligible (percentiles), but we need to design for handling this type of selective aggregation.
With multiple data centers, we also want each DC to perform basic monitoring of each other. Such a mesh-based architecture would enable us to detect multi-DC outages without solely relying on Insights Infrastructure.
Insights has helped us make significant leaps into gaining visibility into our own infrastructure. We have learnt a lot of valuable lessons through this process and are excited to tackle the newer challenges. Subscribe to our blog if you are interested in being a part of our Insights journey and learning more. Also, we are hiring!