[devel] [PATCH 2/2] gb: optimize rebuilt srpm if its identity is equal to identity of srpm in the repo

Пт Апр 17 00:51:51 MSK 2020

On Tue, Apr 14, 2020 at 7:42 PM Vladimir D. Seleznev
<vseleznv на altlinux.org> wrote:
> > Then suppose I build a gearifeid package from Sisyphus for p9. But
> > your code only handles GB_REPO_DIR, not the NEIGHBOUR_REPO_DIR the
> > package comes from. To be clear, that information is lost: when you
> > request to build a signed tag from /gears, it does not imply that
> > there is a corresponding .src.rpm in any REPO_DIR.
>
> It's future part. I wrote some code that check the uprepos, but I didn't
> like it. The correct way is checking uprepos archives as well.

I gave it some more thought.

First, the way you're trying to hash all the unknown tags is
interesting, but I wouldn't do that. You may want to hash everything
if you don't understand the internal structure and prefer to treat the
input as a black box.  On the contrary, we understand the internals of
a package well. What's then a minimal subset of tags to identify a
package? For a source package, it's FileDigests + BuildRequires +
BuildConflicts. That's all! They determine the outcome of a build, and
we may reasonably postulate that the rest of the tags should not
influence that outcome. You don't even have to hash NEVR, because the
specfile is already in FileDigests. (But you should probably hash
FileFlags, because they point out which file is the specfile. You
should also hash FileModes, because some sources may be executable.
But that's about all.)

Second, referring to the discussion about hash functions, the hash
function you're using isn't all that important.  That's because you're
hashing MD5 sums in FileDigests, and those are the weakest link and
(theoretically) the main cause of any collision. The speed isn't
important either, because you're hashing relatively short inputs.

So what's the right set of tags for a binary package, and what is its
identity? (I'm not sure identity is the right word, I would rather
call it ID. Identity is who you are and what you believe in, for
example a black person who votes for Obama.)  I've already hinted that
identity can be defined via substitution: if you replace a package
with a different package but the same identity, there should be no
functional difference, and furthermore no difference "for all intents
and purposes", except for a few observable differences which we deem
immaterial and permit explicitly, such as FileMtimes. So obviously you
need to hash at least FileDigests and
Requires/Provides/Obsoletes/Conflicts. This should satisfy the
definition of ID for rpm (the dependencies are satisfied in the same
way, and file conflicts are the resolved in the same way, so rpm can't
tell the difference if we make a substitution.)

It isn't clear whether you should hash informational sections such as
%description. It can be argued that under the same NEVR, the
description shouldn't change anyway. Is it possible that nothing
changes in a package but the description? Would we still want to
update/replace the package then?

Finally, your identity hash need not to be fixed once and forever. It
is used only for internal bookkeeping, so once in a while you are
allowed to change the hash and rebuild the identity-addressable
storage. You should have a script for that in girar/admin. It may take
an hour or so to complete, but that's not too bad.

> > There is already a problem with cross-repo copying: if done in
> > earnest, both repos need to be locked. And of course this is
> > deadlock-prone. You can do better without any locking if you identify
> > every package in all repos with your new identity hash. This can be
> > done relatively easy, since you already have that big
> > content-addressable storage. You can hardlink it into a shadow
> > identity-addressable storage. Once you've done that, you obtain the
> > global / beatific vision: given a package, you instantly know if you
> > have already seen something like this. (On the second thought: you
> > don't need locking because the -f test is atomic and files cannot be
> > removed from the storage, but there will still be race conditions.
> > It's not too bad in practice. Further those race conditions can be
> > detected at the task-commit stage.)
>
> I like the idea, but there are some issues with this solution: these
> *are* collisions. I explain this below, but this idea will work
> perfectly with sourcerpms.
>
> The problem is that if we want to hande binary rpms as well, there will
> be kind of collisions by design. For example, package foo has two
> subpackages: foo-data and libfoo. After foo rebuild foo-data has the
> same identity as previous foo-data build, but libfoo has the different
> now. According the plan, the whole rebuild has significant changes and
> all binary packages should be substituted with new one. And now we have
> two foo-data packages with the same identity value, but they are belong
> to different builds.
>
> > There is one specific problem with the outlined approach: the notion
> > of identity is flawed, because the disttag may or may not matter.
> > Sometimes you cannot substitute a package for another package with the
> > same identity but a different disttag. Specifically this is the case
> > with strict dependencies between subpackages. You cannot substitute a
> > subpackage unless you also substitute all the other subpackages.
>
> Yes, that is correct, I considered this.

So for src.rpm packages, it's a solved problem. For binary packages,
the identity should specifically exclude disttag. It will no longer
satisfy the definition of ID for rpm (substitution will break for
subpackages with strict dependencies). Therefore for binary packages,
we need to track <ID,disttag> tuples. This is a one-to-many relation:
for each ID, there may be a few disttags.  So for binary packages we
need a separate identity-addressable storage which maps ID to
<disttag,filehash> (while for source packages, a hardlink maps ID to
filehash).  If implemented naively, this will create many small files,
one file per ID, most files with just one line. In a more practical
implementation, you should probably group all those small files by
package name.  So you'll have:

$ cat id2f/libfoo
<libfoo-ID1> <disttag1> <libfoo-filehash1>
<libfoo-ID2> <disttag2> <libfoo-filehash2>

$ cat id2f/foo-data
<foo-data-ID1> <disttag1> <foo-data-filehash1>
<foo-data-ID1> <disttag2> <foo-data-filehash2>

Note that for libfoo, the IDs are different, but with foo-data the IDs
are the same. This indicates that the contents of libfoo have changed
after a rebuild, while the contents of foo-data have not.

Suppose you have such a store, and foo.src.rpm is getting rebuilt
again (or copied to p9). You can then check up with the store and see
if the outcome can be replaced with either libfoo-filehash1 +
foo-data-filehash1 (with disttag1) or libfoo-filehash2 +
foo-data-filehash2 (with disttag2), but not in other combinations.
You'll need an elaborate algorithm which coordinates substitutions
across architectures, but this seems doable.