[afnog] Another Perspective - Kentik's View on the Facebook Outage

Fri Oct 8 04:49:30 UTC 2021

On 10/7/21 19:49, sm+afrinic at elandsys.com wrote:

> Come up with a number would require an academic exercise.  I haven't 
> looked in it as I was more interested in the topic below.
>
> For people interested in configuration management, the following 
> illustrates how to configure over 17,000 systems : 
> https://www.socallinuxexpo.org/scale11x-supporting/default/files/presentations/cfgmgmt.pdf

A network of this scope can only be managed with "automation tools".

Unfortunately, automation tools are not perfect, and managing them is 
just as hard as managing CLI on boxes manually. So at some point or 
other, they will sneeze, and what happened this week at Facebook is a 
risk that they and other operators of similar scope have to accept.

Their task is to reduce the number of times events such as these occur, 
not to expect that they will be zero.

I mean, just yesterday, Telia pushed out a prefix filter update to their 
routers, and caused an outage across a large section of their backbone. 
And they aren't as large as Facebook.

Automation is great, but it is not risk-free, at any scale.

Mark.