AlgoBuzz Blog

Everything you ever wanted to know about security policy management, and much more.

Search
Generic filters
Exact matches only
Search in title
Search in content
Search in excerpt
Filter by Custom Post Type
Posts

The Facebook outage and network configuration

by

Avishai Wool, CTO at AlgoSec, analyses the recent Facebook outage and the risks all organizations face in network configuration 

Social media giant Facebook was involved in a network outage on the 4th October 2021 that lasted for nearly six hours and took its sister platforms Instagram and WhatsApp offline.

As the story developed, it became apparent that the incident was caused by a configuration issue within Facebook’s BGP (Border Gateway Protocol), one of the systems that the internet uses to get your traffic where it needs to go as quickly as possible. The outage also cut off the company’s internal communications, along with authentication to third-party services including Google and Zoom. Some reports suggested security passes went offline, which stopped engineers from entering the building to physically reset the data center.

The impact was felt worldwide, with Downdetector recording more than 10 million problem reports, the largest number for one single incident. Facebook released an official statement following the outage stating: “Our engineering teams learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication.”

While Facebook has assured its users that no data has been lost in this process, the outage is a stark reminder of how small configuration errors can have huge, far-reaching consequences.

The fundamentals of application availability

At the fundamental level, Facebook suffered from a lack of application availability. When a change was actioned, it caused a major chain reaction that ultimately wiped Facebook and its related services from the internet because they couldn’t see the entire lifecycle of that change and the impact it would have.

To avoid an incident like this in the future, organizations should consider a few simple steps:

  • Back up configuration files to allow for rollbacks should an issue arise
  • Use a test system alongside live processes to run scenarios without causing any disruptions
  • Retain low-tech alternatives to guarantee access to the network if the primary route fails

The outages across Facebook’s infrastructure highlight the operational risks all organizations face around faulty configuration changes which can drastically impact application availability. Intelligent automation, thorough change management and proactive checks are key to avoid these outages.

Subscribe to Blog

Receive notifications of new posts by email.