On 8 October, we received alerts of IPv6 peers flapping toward several route servers across both IAA and NZIX. Investigation revealed that peers were receiving a malformed BGP update, isolated to our route server BGP daemon software (Bird v2.0.7). The issue stemmed from Bird propagating an attribute it didn’t support, and it wasn’t just us. Other exchanges including JINX, DINX, CINX, THINX, MegaIXs, LONAP, GetaFIX, PIT-IX, and later EdgeIX were affected too.
The culprit? The way RFC 7606 handles unknown attributes (ironic right?), is to set the transitive bit on a BGP attribute if it’s unknown. Our friends at BGP Tools provided an excellent breakdown of how this can occur:
“If a BGP implementation does not understand an attribute, and the transitive bit is set, it will copy it to another router.“
Source: Benjojo’s Blog – BGP Path Attributes and Grave Error Handling
The issue was initially filtered upstream by the offending peer, which stopped the immediate problem. However, since we remained vulnerable to a recurrence, we rolled out the latest Bird code to our lab environment. Compatibility testing with our route server config generator (arouteserver) showed no issues, so we scheduled maintenance windows with provisions to escalate to emergency maintenance if the fault reappeared.
Upgrades went live on Route Server 2 for IAA on 20 October and NZIX on 23 October. And sure enough, on the same day the issue re-emerged across exchanges, prompting an emergency upgrade of Route Server 1. We’re now running Bird 2.17.2, which includes support for RFC 9234, allowing it to drop a malformed Only-to-Customer (OTC) attribute instead of propagating it.
It’s difficult to confirm whether the problem originated from Bird taking a 4-byte field and malforming it to 1024 bytes. On-the-wire data suggests it remained a 4-byte update, but for stability’s sake, IAA will review alternative BGP software stacks for route servers to reduce any single-point dependency on the Bird BGP stack.