[afnog] Another Perspective - Kentik's View on the Facebook Outage

Fri Oct 8 11:50:42 UTC 2021

Hello,

On Thu, 7 Oct 2021 at 10:35, Markus Akena Wipfler
<markus.wipfler at gmail.com> wrote:
>
> All good and well but no mention of disappearance of the actual DNS records on their authoritative DNSes :) during that period. I don't think it's per design, eg remove DNS record if prefix X is not seen in routing table.
>

I very much doubt any DNS *records* where changed anywhere. I am not
sure why you think that.

FB themselves say (ref
https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/)
 that:

" To ensure reliable operation, our DNS servers disable those BGP
advertisements if they themselves can not speak to our data centers,
since this is an indication of an unhealthy network connection. In the
recent outage the entire backbone was removed from operation,  making
these locations declare themselves unhealthy and withdraw those BGP
advertisements. The end result was that our DNS servers became
unreachable even though they were still operational. "

This is matches with reports from folks who run DNS resolvers, which
where giving back SERVFAIL - which is consistent with when the
recursive resolver cannot talk to the upstream authoritative server at
all.

In other words, FB probably has anycast auth DNS systems in many
distributed PoPs/locations.

In each location, the systems monitor their own control-plane
connections (FB's backbone). When that is down (and thus the PoP
cannot get updates xfr in, cannot send logs back, etc.) then the
systems withdraw the data-plane/anycast IPs so that traffic goes
elsewhere.

This is actually a great design for any partial outage. Any remote
location that FB looses control of, would take itself offline from
serving (soon to be stale) data. Anycast would ensure than traffic is
sent to alternate locations.

They clearly never planned for *ALL* locations to lose connection to
home-base *at the same time*. Oops.

-- Daniel