Prof. Avishai Wool
Short bio about author here Lorem ipsum dolor sit amet consectetur. Vitae donec tincidunt elementum quam laoreet duis sit enim. Duis mattis velit sit leo diam.
Tags
Share this article
10/6/21
Published
Avishai Wool, CTO at AlgoSec, analyses the recent Facebook outage and the risks all organizations face in network configuration
Social media giant Facebook was involved in a network outage on the 4th October 2021 that lasted for nearly six hours and took its sister platforms Instagram and WhatsApp offline.
As the story developed, it became apparent that the incident was caused by a configuration issue within Facebook’s BGP (Border Gateway Protocol), one of the systems that the internet uses to get your traffic where it needs to go as quickly as possible. The outage also cut off the company’s internal communications, along with authentication to third-party services including Google and Zoom. Some reports suggested security passes went offline, which stopped engineers from entering the building to physically reset the data center.
The impact was felt worldwide, with Downdetector recording more than 10 million problem reports, the largest number for one single incident. Facebook released an official statement following the outage stating: “Our engineering teams learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication.”
While Facebook has assured its users that no data has been lost in this process, the outage is a stark reminder of how small configuration errors can have huge, far-reaching consequences.
The fundamentals of application availability
At the fundamental level, Facebook suffered from a lack of application availability. When a change was actioned, it caused a major chain reaction that ultimately wiped Facebook and its related services from the internet because they couldn’t see the entire lifecycle of that change and the impact it would have.
To avoid an incident like this in the future, organizations should consider a few simple steps:
Back up configuration files to allow for rollbacks should an issue arise
Use a test system alongside live processes to run scenarios without causing any disruptions
Retain low-tech alternatives to guarantee access to the network if the primary route fails
The outages across Facebook’s infrastructure highlight the operational risks all organizations face around faulty configuration changes which can drastically impact application availability. Intelligent automation, thorough change management and proactive checks are key to avoid these outages.