SUMMARY OF IMPACT: Between 21:03 and 22:28 UTC on 24 Jan 2020, a programming error caused network congestion, dropping around 10% of traffic, and causing high latency for services traversing the network between data centers. A software update on the SWAN (Software enabled Wide Area Network)—Microsoft's backbone network that connects large data center regions together—caused the router forwarding tables to become mis-programmed.
ROOT CAUSE: From 21:03 UTC, the SWAN network was unable to generate and program a working Forwarding Information Base (FIB) into the SWAN routers due to a bad configuration push. Incompatible FIB pushes that caused failure on part of the routers resulted in FIB rollbacks. The incompatible FIB push caused all traffic engineering tunnels on the routers to go down, which made traffic go on the shortest path and resulted in congestion drops in the network of around 10% of traffic.
MITIGATION: When the incompatible FIB was rolled back to the last known good state, Traffic Engineering worked again, and drops subsided. However, the non-working FIB continued to get generated, and its install/rollback kept causing drops until the configuration change was rolled back. Rollback of the configuration change was slow due to safeguards built into the system. The configuration problem was finally rolled back at 22:28 UTC, completely resolving the traffic drops.
NEXT STEPS: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to)
A correction of the software defect that was introduced that prevented router programming
Improving testing of SWAN software deployments with the router firmware
Improving SWAN rollback procedures to make the process faster and less error-prone