[devel] RFC: girar: optimize rebuild

Andrey Savchenko bircoph на altlinux.org
Вт Апр 14 17:57:13 MSK 2020


On Sun, 12 Apr 2020 02:31:43 +0300 Alexey V. Vissarionov wrote:
> On 2020-04-11 13:36:31 +0300, Andrey Savchenko wrote:
> 
>  >> The first part of rebuilt packages optimization for girar.
>  >> It introduces pkg_identity() and simple optimization of the
>  >> rebuilt sourcerpm.
>  >> pkg_identity() takes RPM package and returns a value called
>  >> package identity, a hash of subset of RPM package header.
>  >> That subset is the entire header without some nonessential
>  >> artifacts like buildhost, buildtime, header hashsum, etc.
>  > I see two problems with proposed approach:
>  > 1) It assumes there will be not pkg_identity hash collisions.
>  > This is wrong. They may occur sooner or later and the code
>  > *must* correctly deal with such collisions.
> 
> The solution is well known: prefix the hash with a time_t value
> to let it grow monotonously while still being strictly dependent
> on sensitive data.

Yes, this is a good idea.
 
> Whether we'd face a hash collision, we could check whether the
> timestamps differ significantly.
> 
>  > 2) The hash function choise — sha256 ­— is very unfortunate:
>  > it has longer digest than sha1, but otherwise is vulnerable
>  > to the same attack; so right now it is still marginally secure,
>  > but it will not last long.
> We don't really need any cryptographic-grade hash function here:
> all we need is just a checksum with a good distribution to detect
> whether something had changed - obviously enough, nobody would
> try to build and exploit collisions here. Said that, we can use
> almost any polynomial.

Still it may be a security issue. Consider what will happen if
wrong source rpm will be used: new modifications including security
fixes may be silently omitted from a branch.

>  > Moreover sha256 is quite slow.
> 
> SHA2 is implemented in the hardware in some modern CPUs, so it's
> quite fast there.

Only in some and only for amd64 arch. But our man build infrastructure
also uses ppc64le and aarch64, so it is very important to be
efficient, especially on aarch64 which is a bottleneck for most
tasks. And consider that we have secondary build systems for other
arches like mips, riscv, e2k.

A talk is cheap, so let's see some some numbers.

0) dd if=/dev/urandom of=/tmp/test.file bs=1M count=2048

1) Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
$ time sha256sum -b /tmp/test.file
8.67user 0.27system 0:08.94elapsed 99%CPU (0avgtext+0avgdata 1944maxresident)k
8.70user 0.25system 0:08.96elapsed 99%CPU (0avgtext+0avgdata 2148maxresident)k
8.65user 0.28system 0:08.93elapsed 99%CPU (0avgtext+0avgdata 2064maxresident)k

$ time b2sum -b /tmp/test.file
2.48user 0.32system 0:02.81elapsed 99%CPU (0avgtext+0avgdata 2120maxresident)k
2.46user 0.30system 0:02.76elapsed 99%CPU (0avgtext+0avgdata 2120maxresident)k
2.47user 0.29system 0:02.77elapsed 99%CPU (0avgtext+0avgdata 2068maxresident)k

2) E8C (1300 MHz,  MBE8C-PC v.2)
$ time sha256sum -b /tmp/test.file
11.69user 0.93system 0:12.64elapsed 99%CPU (0avgtext+0avgdata 3784maxresident)k
11.78user 0.85system 0:12.63elapsed 99%CPU (0avgtext+0avgdata 3836maxresident)k
11.72user 0.90system 0:12.63elapsed 99%CPU (0avgtext+0avgdata 3956maxresident)k

$ time b2sum -b /tmp/test.file
6.90user 1.37system 0:08.27elapsed 99%CPU (0avgtext+0avgdata 3896maxresident)k
6.76user 1.10system 0:07.87elapsed 99%CPU (0avgtext+0avgdata 3844maxresident)k
6.93user 0.95system 0:07.88elapsed 99%CPU (0avgtext+0avgdata 3872maxresident)k

I see no reason for using slower and less secure sha256 algorithm.

Best regards,
Andrew Savchenko
----------- следующая часть -----------
Было удалено вложение не в текстовом формате...
Имя     : отсутствует
Тип     : application/pgp-signature
Размер  : 833 байтов
Описание: отсутствует
Url     : <http://lists.altlinux.org/pipermail/devel/attachments/20200414/6468b4e8/attachment.bin>


Подробная информация о списке рассылки Devel