[afnog] Another Perspective - Kentik's View on the Facebook Outage

Fri Oct 8 18:21:09 UTC 2021

Hi Mark,
At 09:49 PM 07-10-2021, Mark Tinka wrote:
>A network of this scope can only be managed with "automation tools".
>
>Unfortunately, automation tools are not perfect, and managing them 
>is just as hard as managing CLI on boxes manually. So at some point 
>or other, they will sneeze, and what happened this week at Facebook 
>is a risk that they and other operators of similar scope have to accept.
>
>Their task is to reduce the number of times events such as these 
>occur, not to expect that they will be zero.
>
>I mean, just yesterday, Telia pushed out a prefix filter update to 
>their routers, and caused an outage across a large section of their 
>backbone. And they aren't as large as Facebook.
>
>Automation is great, but it is not risk-free, at any scale.

Getting it all to work is a bit more than the software.  There are 
review processes to catch configuration errors.  Canary testing [1] 
is used in some environments to mitigate the risk of a large-scale outage.

Nowadays, some of the software is available for free.  That can be 
useful to anyone interested in avoiding NIH.

I saw a few messages about the Telia incident.  I haven't had time to 
look into it.

Regards,
S. Moonesamy

1. https://status.cloud.google.com/incident/compute/17003#5660850647990272