On October 4, Facebook had a massive outage. During the downtime, all Facebook APPs were inaccessible through external services for up to seven hours. Facebook said an engineer accidentally disconnected “all worldwide network connections” to the data centre.
So, according to the Facebook statement, at the time of the accident, Facebook had deactivated its DNS prefix routing BGP notice, making Facebook’s DNS inaccessible.
Moreover, Facebook’s demise came dangerously close to destroying the Internet. As a result of Facebook’s domain name resolution cache failing on all tiers of DNS servers, users will feverishly retry to log in to the app. This puts a huge amount of strain on the root DNS, 18.104.22.168.
According to reports, this increases the DNS resolution query performance of 22.214.171.124 by 30 times. Thankfully, 126.96.36.199 remained untouched. The response time for the majority of DNS resolution queries was stable at about 10 milliseconds during the Facebook outage. Otherwise, the implications would be catastrophic if the root DNS also failed.
What have you taken away from this experience?
I believe it will take less than a minute for Facebook’s powerful people to identify the issue’s cause. The issue persists even after the erroneous BGP notification command has been delivered. It was about 7 hours. We may infer an essential conclusion based on the details of all intranet disruptions: connected erroneous orders have affected all VPN channels.
We know that the epidemic has hit Facebook, and that Americans continue to work remotely. In other words, if the remote operation and maintenance engineers’ VPN and escape channels fail, the data centre location becomes unreachable. Personnel on duty can only do basic operations like turning on and restarting computers. It does not exclude the possibility that on-site workers lack the permission to access the core network equipment. Remote operation and maintenance experts must fix any issues remotely and on-site.
Facebook’s network engineers may have underestimated the fallback plan’s viability, only to learn later that the network equipment was unable to remotely pass. The technique has been logged in, shattering the rollback plan. Planning for the worst is critical before releasing any version, and assuming minor bugs won’t arise is erroneous.
The remote login escape channel must have been hacked at the time of the failure. The lesson here is to ensure the escape path is available in peacetime and to protect its independence. It’s vital not to mix escape and normal procedures.