We have recently invested a significant amount of work to re-architect and design a new platform for metrics around the IX Australia fabric. We’d like to show you where we’ve come from, the technology stack we’ve chose and what is to come.
Existing Technology Stack
For the past five years, we have used Cacti to poll network devices on a five-minute interval and generate the appropriate averages for per-interface input and output traffic. There was a completely separate Cacti instance per-IX ensuring that network segregation would not result in loss of real-time data, limiting aggregation and analysis across IXs.
A custom script pulled route metrics from our Bird route servers to track per-peer accepted and filtered routes as well as totals.
Cacti stores data in RRD files – a well known and time-tested means of storing time series data on disk. Unfortunately, RRDs come with their own limitations:
- RRDs keep data for a given period then aggregates that data into larger averages for long-term storage. High resolution data is discarded and thus reducing historical accuracy.
- Aggregating data over a large number of interfaces is cumbersome and requires manual edits to RRD rules. (E.g. per-IX Aggregate traffic volume graphs).
To overcome these limitations and expand our visibility we designed and implemented an expanded system for collecting, storing and analysing high-resolution metrics.
The key goal of this project was to provide network insights to IX Australia development team and to our members to ensure our process of continual improvement can keep up with our member’s needs. We wanted to collect and store the highest resolution of information possible in a scalable and high-performance way without adding operational overheads. It was crucial that our technology stack be automated such that repeatable and consistent results are achieved across the board. Lastly, it was imperative that whatever infrastructure is built must have a high level of resilience when faced with partial or complete network segregation – monitoring data should be treated like any other production information and designed accordingly.
New Technology Stack
For time-series storage, we chose InfluxDB due to it’s relative maturity amongst the recent wave of tag-based time-series storage engines. InfluxDB is an Open Source product developed by InfluxData. InfluxDB allows for fast metric queries based on arbitrary tags using an SQL-like syntax. Influx also includes an array of mathematical and statistical functions for data analysis.
To ensure resiliency, each collector has a local InfluxDB-relay instance that acts as a local buffer and automatically re-attempts metric delivery in the event that the central metric store is unavailable – this is very useful for software updates and security patches requiring reboots and service restarts of the central store. The relay service can also be configured to deliver metrics to multiple backends in an HA configuration.
Thanks to the contributors of a fantastic golang-based Open Source project Snmpcollector. This project consists of a single Golang binary that provides a configurable and high performance SNMP poller that pushes results to InfluxDB. Most importantly, this project has a Restful API as a first-class citizen making it ideal for automation workflows.
Telegraf is another Open Source product developed by InfluxData with a modular, plugin-based model for collecting host metrics and reporting them to InfluxDB. There is a wide range of plugins to support most applications with many community pull requests pending with new ones. Telegraf also has an in-built buffering and retry mechanism such that loss of connectivity to InfluxDB will cause metrics to buffer. Telegraf is installed on every single server and utilises an appropriate list of plugins. Using the “exec” plugin and the Bird-Tools binary, custom metrics can be reported in InfluxDB Line Protocol for per-peer route server metrics.
IX Australia already utilises Puppet for configuration automation on our systems and so this was a natural choice to extend to this metrics solutions. SNMPCollector required a custom module to be developed as well as a module for InfluxDB-Relay. All configuration for metrics, devices, tagging and application configuration is maintained in Hiera YAML and Puppet ensures every SNMPCollector instance is consistent.
Each IX had an independent monitoring stack, manually configured to collect metrics and display basic per-port graphs and per-ASN route statistics. The new architecture is much more distributed with a Ubuntu VM for each IX acting as a collector and reporting metrics back to a central store. Buffering mechanisms have been implemented to prevent the loss of metrics during potential outages – metrics are kept on the collector until connectivity is restored.
This design allows for metric aggregation and analysis across IXs rather than having independent graphing instances. We’re now more easily able to see trends across IXs and display all a peer’s port and route information on the one page.
We have made the three dashboards displayed in this blog public to all members at https://metrics.ix.asn.au. We’re very interested in your feedback and you can send your feedback to email@example.com.
Going forward, we want to provide better metrics and insights for members with integration of this stack with the members portal. Future draft improvements include:
- Optic transceiver statistics
- Interface state change alerts
- Detailed traffic stats (max, min, volume) per interface.
- Route churn statistics.
- Detailed filtered route lists.
We intend to release future blogs with some of the most interesting insights we’ve gained as well as future feature releases. Please add https://www.internet.asn.au/category/blog/ to your favourites list and check in regularly!