[afnog] Another Perspective - Kentik's View on the Facebook Outage
Mark Tinka
mark at tinka.africa
Fri Oct 8 04:49:30 UTC 2021
On 10/7/21 19:49, sm+afrinic at elandsys.com wrote:
> Come up with a number would require an academic exercise. I haven't
> looked in it as I was more interested in the topic below.
>
> For people interested in configuration management, the following
> illustrates how to configure over 17,000 systems :
> https://www.socallinuxexpo.org/scale11x-supporting/default/files/presentations/cfgmgmt.pdf
A network of this scope can only be managed with "automation tools".
Unfortunately, automation tools are not perfect, and managing them is
just as hard as managing CLI on boxes manually. So at some point or
other, they will sneeze, and what happened this week at Facebook is a
risk that they and other operators of similar scope have to accept.
Their task is to reduce the number of times events such as these occur,
not to expect that they will be zero.
I mean, just yesterday, Telia pushed out a prefix filter update to their
routers, and caused an outage across a large section of their backbone.
And they aren't as large as Facebook.
Automation is great, but it is not risk-free, at any scale.
Mark.
More information about the afnog
mailing list