[devel] hash collision in rpm
Alexey Tourbin
alexey.tourbin на gmail.com
Ср Май 5 08:40:27 MSK 2021
On Tue, May 4, 2021 at 3:50 PM Alexey Tourbin <alexey.tourbin на gmail.com> wrote:
> On Tue, May 4, 2021 at 1:45 PM Alexey Gladkov <legion на altlinux.ru> wrote:
> > Если мы выбираем bpp таким образом, чтобы вероятность коллизии была
> > 1/1024, то в больших библиотеках (у rust там 16215 symbols) ожидаются
> > коллизии, это нормально.
>
> The number of birthday collisions among n-bit hashes is small only if
> we hash up to about sqrt(2^n) items. If we want to keep the number of
> birthday collisions under control, then bpp=10+log2(n) is simply a bad
> formula. A better one is bpp=max(10,log2(n))+log2(n). I.e. for
> n=16215, as per your example, it should be 14+14 rather than 10+14.
>
> Now, there is a subtle argument why you shouldn't go 14+14, and why
> 10+14 is indeed "normal". One view is that we should try to keep the
> number of birthday collisions small *per a provided version*. But
> this view is naive. Consider what happens when we have one big shared
> library and we split it into two smaller libraries. It seems that
> smaller libraries can use smaller bpp, e.g. 13+13 each instead of
> 14+14 combined. Indeed the number of expected collisions E[X] per each
> library is the same. But now we've got two libraries, and the combined
> expectation is linear! From the perspective of a client, one OR the
> other library is now twice as likely to exhibit a collision, simply
> because there are two of them!
> So the right view must be under the invariant "it doesn't matter how
> the symbols are spread among libraries". That's why "probability per
> symbol" makes sense, and "restricting birthday collisions" does not.
Let's check that the invariant holds with bpp=10+log2(n). Again, we
split a long set-version with bpp=10+14 into two shorter set-versions
with bpp=10+13 (the number of symbols being 2^14 resp. 2^13). The
number of hash values over the hash space, which is how we define the
false positive rate, is the same: 2^14/2^24=2^13/2^23=2^-10.
A related notion is the average distance between two neighbour hash
values (when we "drop" hash values into the interval [0,2^bpp)). The
distance is approximately 2^-10 in both cases. It defines the optimal
parameter m for Golomb-Rice code. Roughly speaking, with
bpp=10+log2(n), the optimal m is 10. With optimal m, the length of
Golomb-Rice code is approximately m+2 bits per integer. So the
combined length of the two shorter versions must be about the same as
the length of the long version.
Finally we need to check the birthday collision rate. With bpp-bit
hash values, n of them, the expected number of birthday collisions is
(to a good approximation)
E[X]=n^2 / 2^(bpp+1)
(refer to Cormen et al., Introduction to Algorithms (3rd ed.), p. 133.)
Let E_0 be the expected number of collisions for a long version. For a
shorter version with (bbp-1)-bit hash values, n/2 of them, we have
E_1 = (n/2)^2 / (2^(bpp-1+1) = 1/2 E_0
So the expected number of collisions for short versions is smaller,
but we have two of them. By the linearity of expectation, the birthday
collision rate is the same. (This generally shows that it is pointless
to split a long set-version into a few shorter ones, under any classes
of equivalence on symbols such as ELF ABI tags. Counterintuitively,
you can't fight birthday collisions by partitioning, unless you also
increase the output length. So doing separate set-versions for
libfoo.so.0(FOO_1.0) and libfoo.so.0(FOO_1.1) is not necessarily a
good idea, we can just as well hash sym на FOO_1.0 and sym на FOO_1.1 and
package them into a single version.)
* * *
By the way, there's a clever way to detect collisions at a higher
level, even though low-level collisions are unavoidable, subject to
implacable math. We can take advantage of the fact that the build
system synchronizes package builds across a few architectures.
Currently, if a missing symbol goes undetected due to a hash
collision, it does so on all architectures simultaneously. To change
that, we need to hash symbols differently, depending on the target
architecture (we can pass seed=hash(arch) to the hash function, or use
different initialization vectors IV=hash(arch)). The desired outcome
is that when a missing symbol goes undetected, it does so, with an
overwhelming probability, on only one target. Roughly speaking, if
the probability of an undetected missing symbol is 10^-3 on a single
target, it must be 10^-6 on two targets, 10^-9 on thee targets, etc.
Подробная информация о списке рассылки Devel