[afnog] Another Perspective - Kentik's View on the Facebook Outage
sm+afrinic at elandsys.com
sm+afrinic at elandsys.com
Fri Oct 8 18:21:09 UTC 2021
Hi Mark,
At 09:49 PM 07-10-2021, Mark Tinka wrote:
>A network of this scope can only be managed with "automation tools".
>
>Unfortunately, automation tools are not perfect, and managing them
>is just as hard as managing CLI on boxes manually. So at some point
>or other, they will sneeze, and what happened this week at Facebook
>is a risk that they and other operators of similar scope have to accept.
>
>Their task is to reduce the number of times events such as these
>occur, not to expect that they will be zero.
>
>I mean, just yesterday, Telia pushed out a prefix filter update to
>their routers, and caused an outage across a large section of their
>backbone. And they aren't as large as Facebook.
>
>Automation is great, but it is not risk-free, at any scale.
Getting it all to work is a bit more than the software. There are
review processes to catch configuration errors. Canary testing [1]
is used in some environments to mitigate the risk of a large-scale outage.
Nowadays, some of the software is available for free. That can be
useful to anyone interested in avoiding NIH.
I saw a few messages about the Telia incident. I haven't had time to
look into it.
Regards,
S. Moonesamy
1. https://status.cloud.google.com/incident/compute/17003#5660850647990272
More information about the afnog
mailing list