[afnog] Another Perspective - Kentik's View on the Facebook Outage

Fri Oct 8 17:42:55 UTC 2021

Hi,

correct, no actual NS records were changed. This was a wrong assumption on
my part.

The withdrawal of unreachable prefixes is common practice for any anycast
service.

In terms of solution, you could use 3rd party for some of your aDNS
infrastructure IPs. You would then also need to have reachability of your
DCs via 3rd party IP space or host your services in 3rd party
infrastructure. I assume FB looked at this option and saw more benefits in
being in total control of their infrastructure...  trust vs benefit....

They could consider at least having some OOB reachability via an
independent path, so to at least get into their boxes.

What I still can't wrap my head around is that all datacenters globally
stopped being reachable at once. I assume each DC has each own dedicated IP
space. So if a script would remove advertisements of IP prefix of DC1 that
aDNs that uses DC1 would remove it's Anycast prefix announcement, but
shouldn't DC2 still be reachable? Perhaps therein lies the solution, limit
your automation tools to one DC ....

BR,
Markus

On Fri, 8 Oct 2021, 1:50 pm Daniel Shaw, <dshaw78 at gmail.com> wrote:

> Hello,
>
> On Thu, 7 Oct 2021 at 10:35, Markus Akena Wipfler
> <markus.wipfler at gmail.com> wrote:
> >
> > All good and well but no mention of disappearance of the actual DNS
> records on their authoritative DNSes :) during that period. I don't think
> it's per design, eg remove DNS record if prefix X is not seen in routing
> table.
> >
>
> I very much doubt any DNS *records* where changed anywhere. I am not
> sure why you think that.
>
> FB themselves say (ref
> https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/)
>  that:
>
> " To ensure reliable operation, our DNS servers disable those BGP
> advertisements if they themselves can not speak to our data centers,
> since this is an indication of an unhealthy network connection. In the
> recent outage the entire backbone was removed from operation,  making
> these locations declare themselves unhealthy and withdraw those BGP
> advertisements. The end result was that our DNS servers became
> unreachable even though they were still operational. "
>
> This is matches with reports from folks who run DNS resolvers, which
> where giving back SERVFAIL - which is consistent with when the
> recursive resolver cannot talk to the upstream authoritative server at
> all.
>
> In other words, FB probably has anycast auth DNS systems in many
> distributed PoPs/locations.
>
> In each location, the systems monitor their own control-plane
> connections (FB's backbone). When that is down (and thus the PoP
> cannot get updates xfr in, cannot send logs back, etc.) then the
> systems withdraw the data-plane/anycast IPs so that traffic goes
> elsewhere.
>
> This is actually a great design for any partial outage. Any remote
> location that FB looses control of, would take itself offline from
> serving (soon to be stale) data. Anycast would ensure than traffic is
> sent to alternate locations.
>
> They clearly never planned for *ALL* locations to lose connection to
> home-base *at the same time*. Oops.
>
> -- Daniel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.afnog.org/pipermail/afnog/attachments/20211008/a310f6de/attachment.html>