[afnog] Internally inconsistent RRDP publication (Was: [Sidrops] another TA oopsie)

Job Snijders job at fastly.com
Mon Jan 2 16:11:43 UTC 2023


Dear all,

I took a look at what might have transpired. It appears there was an
internally-inconsistent RRDP publication. Similar to the RSYNC protocol,
the RRDP protocol does not offer any assurances about internal
consistency. In this message I offer a step-by-step explanation and at
the end of the email I theorize on how this could've happened.

Impact:
=======

The problem revolves around a 'top level' manifest [1] which contained
references to files which were not yet available via RRDP. The
K1eJenypZMPIt_e92qek2jSpj4A.mft manifest referencing non-existing files
negatively impacts about 77.33% of ROAs subordinate to the Afrinic trust
anchor. Depending on the RRDP refetch timers of a validator, the impact
may have lasted anywhere between 1 and 60 minutes.

This impacted all RFC-compliant validators, the event was 'timing
dependent' rather than 'implementation dependent': connecting at the
wrong time caused problems.

Step by step replay:
====================

A validator fetching Afrinic's RRDP Notification file at
2023-01-01T03:21:51Z, might have fetched a notification XML file which
contained a listing of deltas up until serial 58617 (in the RRDP session
ID 11218e02-4ae9-4c95-a8fa-49df27f15272).

https://rrdp.afrinic.net/11218e02-4ae9-4c95-a8fa-49df27f15272/58616/snapshot.xml
https://rrdp.afrinic.net/11218e02-4ae9-4c95-a8fa-49df27f15272/58617/delta.xml

The SHA256 hash of "K1eJenypZMPIt_e92qek2jSpj4A.mft" at serial 58616 was
435c65e0f7bc43eaea3234b3ad08b849735c1899c8e218ff2395d37cad720493, and
the manifestNumber was 13F1.

At rrdp_serial 58616 / K1eJenypZMPIt_e92qek2jSpj4A.mft / manifestNumber
13F1, the listed SHA256 hash of "vY7ReUeW-s0Fq4qzboGCgYmDQXg.cer" was
1768a7544c15081ddcd358a78b915a7221f3aee6cebb196a743b89a834364ca4. And
indeed, if one downloads the above mentioned "snapshot.xml" file and
unpacks the RRDP XML one will find a file by that name which matches
that digest. The state at RRDP serial 58616 was internally consistent.

Now, let's unpack the RRDP Delta which would bring the RRDP session to
58617, the delta file contains 4 <publish/> elements:

    58617 rpki.afrinic.net/repository/afrinic/K1eJenypZMPIt_e92qek2jSpj4A.crl (a4f73c2009f4095970f0f7cb4bb938eb03ff71e35925cd8bca39a64330f935c1 replaces 502d94adf603c4451a912828dfe9d7a46ebf45ec20f901381618fc71323da927)

    58617 rpki.afrinic.net/repository/afrinic/K1eJenypZMPIt_e92qek2jSpj4A.mft (e745ccf5741fbe65c2e2b78a74ba3be4a82c9fd5330544e16332e725861f66e5 replaces 435c65e0f7bc43eaea3234b3ad08b849735c1899c8e218ff2395d37cad720493)

    58617 rpki.afrinic.net/repository/member_repository/F36D8ADD/99DB6EFC6AC711EBB90AF548F8AEA228/JrOnWLLY0r61xvaBylvZJYx593c.crl (331a8991ca11ccd9bbf30e89e8e35d3b6ee0a18c23cca1289dfcf07bdee3d05f replaces 5a7399b06a692dd76e3b94fa52112c12f483db1499e5c899ff27b57952e48635)

    58617 rpki.afrinic.net/repository/member_repository/F36D8ADD/99DB6EFC6AC711EBB90AF548F8AEA228/JrOnWLLY0r61xvaBylvZJYx593c.mft (cf22f16de6695f8509a6590f710778cc61a1bbdf1c11ae150dcfff1910032cae replaces 29219ecb0f79922d6f1e5d4b3d4305333d32f33720cf13ae17d84dd2fcdf2ff0)

Let's focus on K1eJenypZMPIt_e92qek2jSpj4A.mft. The econtent of the
manifest files whose SHA256 digests are
435c65e0f7bc43eaea3234b3ad08b849735c1899c8e218ff2395d37cad720493 and
e745ccf5741fbe65c2e2b78a74ba3be4a82c9fd5330544e16332e725861f66e5 decode
as following:

K1eJenypZMPIt_e92qek2jSpj4A.mft @ 13F1: https://sobornost.net/~job/manifest-13F1.txt
K1eJenypZMPIt_e92qek2jSpj4A.mft @ 13F2: https://sobornost.net/~job/manifest-13F2.txt

Thus, we conclude:
    RRDP serial 58616 contained a manifest with number 13F1
    RRDP serial 58617 contained a manifest with number 13F2

Both 13F1 and 13F2 are signed by the proper keys, but manifestNumber
13F2 is higher than 13F1; thus 13F2 is the manifest that must be used.

Manifest 13F2 references a new version of "vY7ReUeW-s0Fq4qzboGCgYmDQXg.cer"
by hash 8aa55347427b75faa64fdfd212ca013957f785e18ce887bbe56d0ae20552e66c,
however, at RRDP serial 58617 the delta XML does *NOT* contain any new
version of "vY7ReUeW-s0Fq4qzboGCgYmDQXg.cer"!

In fact, an update for "vY7ReUeW-s0Fq4qzboGCgYmDQXg.cer" only became
visible at a later point in time: at RRDP serial 58618. Looking at
https://rrdp.afrinic.net/11218e02-4ae9-4c95-a8fa-49df27f15272/58618/delta.xml
we finally see a version of "vY7ReUeW-s0Fq4qzboGCgYmDQXg.cer" which
matches the hash on the manifest that was published at serial 58617.

In other words, AFRINIC published a RRDP delta (and snapshot) which were
cryptographically valid, but internally inconsistent.

Researchers can see this themselves if they analyse:
https://rrdp.afrinic.net/11218e02-4ae9-4c95-a8fa-49df27f15272/58617/snapshot.xml
The version of "K1eJenypZMPIt_e92qek2jSpj4A.mft" inside the 58617
snapshot points to "vY7ReUeW-s0Fq4qzboGCgYmDQXg.cer" expecting a file
with sha256 message digest 
8aa55347427b75faa64fdfd212ca013957f785e18ce887bbe56d0ae20552e66c
but the hash of "vY7ReUeW-s0Fq4qzboGCgYmDQXg.cer" actually is
1768a7544c15081ddcd358a78b915a7221f3aee6cebb196a743b89a834364ca4

As per RFC 9286 - the above scenario is considered a "publisher error"
or a "substitution attack" (RPs can't know the difference between
publisher errors and attacks); the RP is expected to proceed with the
process described in Section 6.6 of RFC 9286.

serial 58616 was good
serial 58617 was bad
serial 58618 was good

While the issue was 'rectified' in the next publication, any clients
that latched on to 58617 might take between 1 and 60 minutes to return
for new data; completely unaware that the contents of the 58617 update
were cryptographically valid, but logically mostly broken.

How can this happen?
====================

This type of internal inconsistency could arise from deployment
scenarios in which the RRDP XML files are synthesized from a bare
directory on the filesystem - without additional context about internal
consistency (e.g. when exactly the Signer software has written a
coherent state to the filesystem, and it is safe to transform the files
into RRDP).

Software like https://github.com/NLnetLabs/rrdpit inherently is unaware
whether the Signer software has finished writing to the filesystem (or
still is 'half way' in the writing process). This means that a tool like
"rrdpit" MUST only be invoked when the signer software is completly
finished.

Generating RRDP XML files while the Signer software still is 'half way'
done writing; can result in accidentally smearing out what should've
been the contents of a single RRDP XML Delta file, across multiple RRDP
delta files.

Why am I suspecting that a tool like "rrpdit" is used?
======================================================

The AfriNIC RRDP snapshots contain unexpected files, such as
"rsync://rpki.afrinic.net/repository/AfriNIC-simple.tal"; the signer
implementations I am aware of would not include .tal files in the RRDP
feed. This leads me to believe that a
non-atomic/fragile-to-inconsistency process is used to convert a
(rsync?) directory to RRDP files.

Is "rrdpit" bad?
================

No. It is a very useful utility (I myself have used in it various lab
tests), but needs to be handled with care: the utility is not aware of
internal inconsistencies and cannot compensate for internal
inconsistencies. The "rrdpit" utility is not appropriate for all
deployment scenarios: it probably is best to use the native RRDP
functionality of a Signer!

How to avoid this?
==================

If AFRINIC is using the "rpki.net" (or a derivative) signer software,
they might benefit most using the embedded RRDP functionality of the
"rpki.net" software stack.

If AfriNIC does not want to expose a webserver on the signer machine
itself, they can simply rsync the ready-made RRDP XML files (produced by
"rpki.net") to a webserver; (this approach contrasts with rsyncing the
rsync files and using "rrdpit" - or equivalent tooling).

Conclusion
==========

For a brief period of time AFRINIC published a set of RRDP files that
led to an inconsistent stage, resulting in the temporary loss of 77% of
ROAs.

As I don't know the internals of AFRINIC's setup, so the above could all
be a fitting - but wrong - theory. I am speculating with the public
information available to me.

I'm available for any questions, or to advise on this matter and review
the current process workflow.

Kind regards,

Job

[1]: https://console.rpki-client.org/rpki.afrinic.net/repository/afrinic/K1eJenypZMPIt_e92qek2jSpj4A.mft.html

On Sat, Dec 31, 2022 at 07:40:54PM -0800, Randy Bush wrote:
> From: PacketVis <notifications at packetvis.com>
> Subject: bgp ta-malfunction - low severity - PacketVis
> 
> Possible TA malfunction: 77.33% of the ROAs disappeared from AFRINIC.
> 
> See more details about the event:
> https://packetvis.com/#/bgp/event/2a35a5824772ae3b651293ec5d9b6367-37572a3c-b445-4075-9741-a419b516ca36/6d742c0ae811df9c41ab427a8ac09e07a93388c7
> 
> _______________________________________________
> Sidrops mailing list
> Sidrops at ietf.org
> https://www.ietf.org/mailman/listinfo/sidrops



More information about the afnog mailing list