Categories

Tech News – Metrics, upgrades and hardware moves: a busy few months!

From rolling out a brand-new metrics system to tackling firmware bugs and optimising hardware across the network, the past few months have been anything but quiet. We’ve been refining how we collect and display network data, upgrading devices to keep things running smoothly, and consolidating hardware to future-proof our infrastructure. Here’s a look at what’s been happening behind the scenes.

New metrics system is a work in progress (but a big step forward!)

Our old metrics system served us well for years—but let’s be honest, it wasn’t keeping up with our evolving config generation. That led to issues like peer metrics failing to index the right ports, along with some security “quirks” we won’t elaborate on.

Former Technical Team Lead Nick Pratley spearheaded a search for a better way to collect network metrics, and the result is here – check it out! 

The new stack combines SNMP Exporter (with custom enrichment via our portal), Prometheus, Victoria Metrics, and Grafana. While the system is still a work in progress, we’d love your feedback, and do let us know if you spot anything off!

For those wondering, historical metrics from the old system aren’t disappearing; we just have a few final touches to bring them into the new platform.

Side note: If you saw some wildly unstable graphs on 31/01/25 (including a rather scandalous spike to 2.59Tb/s—sadly, not real… yet), that was due to a newly added SNMP Exporter module. The way Victoria Metrics handles Prometheus scrapes caused inconsistent timestamps, which threw off rate calculations. A quick tweak (honor_timestamps) fixed it, but not before a recalibration spike. Did we mention this is still a work in progress?

Firmware upgrades: when a cold spare saves the day!

Late 2024, an Equinix PE2 device ran into trouble—its management network interface reported a ‘Tx Unit Hang’, reset itself… and then never came back. Before we could fully diagnose it with the vendor, the device went completely unresponsive—no lights, no console, nothing. Our Perth-based PHP developer Kyle stepped up and swapped in our cold spare (cheers, Kyle!), bringing everything back to normal.

Then, just before Christmas, a device at NextDC P1 rebooted unexpectedly. Given our recent failure, this was not a welcome surprise. Turns out, excessive SNMP instances triggered a bug that crashed the device. We quickly dialed back SNMP pollers, and after confirming a software fix, we rolled out a firmware upgrade across the fleet.

Alongside these upgrades, we regenerated device configs to align them with the portal. For some Members, this means previously unshaped VLL services are now correctly shaped—so if you’ve been enjoying a free ride, sorry, that’s over. A postmortem was sent out for each outage, but if you notice loss on your VLL after a firmware update, you may need to adjust your VLL speed.

Hardware consolidations: NSW-IX gets future-ready

NSW-IX has been undergoing maintenance to consolidate hardware. With our modern Arista 400Gbps devices supporting more 100G LR1 Members, we’re able to retire some older 100Gbps switches, redeploy them where they’re needed most, and still maintain 100Gbps switches for 10Gbps access ports. Bottom line: NSW-IX is well-positioned for future capacity demands, with ample 10/100/400Gbps availability.

Looking ahead, this hardware consolidation will allow for rapid deployment at potential new sites like NextDC S2 and NextDC M2. But more immediately, it enables 100Gbps Member ports at SA-IX and VDC-PER01 at WA-IX—just in time for the upcoming QV1 farewell. Bigger ports? More content distribution via AS10084? Stay tuned!

More Posts

Sign up to IAA's mailing list

Complete this form to receive all our latest news, events and updates.