[devel] hash collision in rpm

Alexey Tourbin alexey.tourbin на gmail.com
Вт Май 4 15:50:00 MSK 2021


On Tue, May 4, 2021 at 1:45 PM Alexey Gladkov <legion на altlinux.ru> wrote:
> On Tue, May 04, 2021 at 01:32:59PM +0300, Anton Farygin wrote:
> > On 01.05.2021 12:04, Alexey Gladkov wrote:
> > > On Sat, May 01, 2021 at 06:44:31AM +0000, ALT beekeeper wrote:
> > > > Package: rust-1:1.50.0-alt1
> > > > Status: Sisyphus/x86_64 test rebuild failed
> > <skip>
> > > Меня немного настораживает такое обилие предупреждений о коллизии. Как бы
> > > не было беды, когда кто-то решит собрать модули rust отдельно.
> > >
> > Я тоже с таким же столкнулся в новом пакете с библиотеками для C++:
> >
> > warning: hash collision: _ZN23IGESData_IGESReaderData12AddStartLineEPKc
> > _ZNK14IGESSolid_Loop4EdgeEi
> >
> > и там такого довольно много.
>
> Я обсудил это с Димой и он объяснил мне, что это не признак ошибки.
> Вот тут в suggest_bpp.c есть комментарий:
>
> http://git.altlinux.org/gears/r/..git?p=rpm-build.git;a=commitdiff;h=0ea2deff
>
> Если мы выбираем bpp таким образом, чтобы вероятность коллизии была
> 1/1024, то в больших библиотеках (у rust там 16215 symbols) ожидаются
> коллизии, это нормально.

The number of birthday collisions among n-bit hashes is small only if
we hash up to about sqrt(2^n) items. If we want to keep the number of
birthday collisions under control, then bpp=10+log2(n) is simply a bad
formula. A better one is bpp=max(10,log2(n))+log2(n). I.e. for
n=16215, as per your example, it should be 14+14 rather than 10+14.

Now, there is a subtle argument why you shouldn't go 14+14, and why
10+14 is indeed "normal". One view is that we should try to keep the
number of birthday collisions small *per a provided version*.  But
this view is naive. Consider what happens when we have one big shared
library and we split it into two smaller libraries. It seems that
smaller libraries can use smaller bpp, e.g. 13+13 each instead of
14+14 combined. Indeed the number of expected collisions E[X] per each
library is the same. But now we've got two libraries, and the combined
expectation is linear! From the perspective of a client, one OR the
other library is now twice as likely to exhibit a collision, simply
because there are two of them!

You could achieve the same effect - doubling E[X] - without splitting
a big library into two smaller libraries. Just use the big library
with bpp=13 instead of 14. :-)

So the right view must be under the invariant "it doesn't matter how
the symbols are spread among libraries". That's why "probability per
symbol" makes sense, and "restricting birthday collisions" does not.


Подробная информация о списке рассылки Devel