[devel] stopping a cascade of rebuilds

Чт Апр 23 22:21:28 MSK 2020

On Mon, Apr 20, 2020 at 12:05:26PM +0300, Alexey Tourbin wrote:
> On Fri, Apr 17, 2020 at 4:54 PM Dmitry V. Levin <ldv на altlinux.org> wrote:
> > > So what's the right set of tags for a binary package, and what is its
> > > identity? (I'm not sure identity is the right word, I would rather
> > > call it ID. Identity is who you are and what you believe in, for
> > > example a black person who votes for Obama.)  I've already hinted that
> > > identity can be defined via substitution: if you replace a package
> > > with a different package but the same identity, there should be no
> > > functional difference, and furthermore no difference "for all intents
> > > and purposes", except for a few observable differences which we deem
> > > immaterial and permit explicitly, such as FileMtimes. So obviously you
> > > need to hash at least FileDigests and
> > > Requires/Provides/Obsoletes/Conflicts. This should satisfy the
> > > definition of ID for rpm (the dependencies are satisfied in the same
> > > way, and file conflicts are the resolved in the same way, so rpm can't
> > > tell the difference if we make a substitution.)
> >
> > What about various types of scripts?
> >
> > $ rpmquery --querytags | grep 'PROG$'
> > PREINPROG
> > POSTINPROG
> > PREUNPROG
> > POSTUNPROG
> > VERIFYSCRIPTPROG
> > TRIGGERSCRIPTPROG

Well, so, your position is that we control and understand our packages,
and we can explicitly define what tags are essential for the task.
Indeed, my idea was to filter all unrelated tags because in the future
rpm-build there can be new tags, but it seems like a needless
reinsurance. You are right that we always can add more tags to calculate
the identity.

> Sure, slipped my mind.

> So for src.rpm packages, it's a solved problem. For binary packages,
> the identity should specifically exclude disttag. It will no longer
> satisfy the definition of ID for rpm (substitution will break for
> subpackages with strict dependencies). Therefore for binary packages,
> we need to track <ID,disttag> tuples.

Why should we track them? If we rebuild a package, we should check
whether identity of its binary packages had changed. If it had not, we
shouldn't replace its binary packages by rebuilt packages. That simple.

The things get more complicated in case of "copying" packages. In that
case this schema could help. But we need also track buildtime. Just to
prevent package replacing with earlier build. So, it is triple now.

> This is a one-to-many relation: for each ID, there may be a few
> disttags.  So for binary packages we need a separate
> identity-addressable storage which maps ID to <disttag,filehash>
> (while for source packages, a hardlink maps ID to filehash).  If
> implemented naively, this will create many small files, one file per
> ID, most files with just one line. In a more practical implementation,
> you should probably group all those small files by package name.  So
> you'll have:
> 
> $ cat id2f/libfoo
> <libfoo-ID1> <disttag1> <libfoo-filehash1>
> <libfoo-ID2> <disttag2> <libfoo-filehash2>
> 
> $ cat id2f/foo-data
> <foo-data-ID1> <disttag1> <foo-data-filehash1>
> <foo-data-ID1> <disttag2> <foo-data-filehash2>
>
> Note that for libfoo, the IDs are different, but with foo-data the IDs
> are the same. This indicates that the contents of libfoo have changed
> after a rebuild, while the contents of foo-data have not.

That is correct.

> It may even make sense to group the mappings by src.rpm name instead
> of package name. At first it seems less intuitive, but in return it
> can give you a consistent view similar to MVCC snapshot.  Of course,
> these files should be updated atomically, with rename(2). To check a
> set subpackages, you first need to copy the file to a local dir. This
> should rule out the case in which some subpackages have been added to
> the file and some not.

I don't get this idea, please expand it.

> These files are to be updated during the task-commit stage, under the
> exclusive lock. This is also the right moment to detect race
> conditions. Suppose you build the same package for sisyphus and p9 in
> parallel, and the build result is the same. Before adding new
> packages, you recheck if the whole set can be replaced with the
> already existing packages.  One of the two tasks then should fail (or
> automatically scheduled for another iteration).

> These IDs can have another application. Suppose a task is resumed, and
> subtask #100 needs a rebuild. It is likely that, after the rebuild,
> the result won't change (all subpackage end up with the same IDs).
> However, we replace the packages anyway.  This will trigger the
> rebuild of subtask #200, if it depends on subtrask #100. There are two
> approaches to spare the unnecessary rebuild:
> 
> 1) Don't overwrite local files with those from the remote build, if
> they are identical.
> 2) When tracking the contents of BuildRoot, take into account IDs
> instead of HeaderSHA1.

I like the idea.

> There is a peculiarity with the first approach: if subtask #100 has
> both arch and noarch subpackages, the decision to overwrite files
> cannot be made locally / per architecture. If you need to overwrite
> subpackages for at least one architecture, then you have to overwrite
> for all architectures, otherwise there will be unmet dependencies
> between arch and noarch subpackages.

That is correct.

> I think we already have this problem: the decision to rebuild a
> subtask is made locally (or rather remotely), per architecture. So
> occasionally we've seen unmet dependencies due to disttag mismatch.
> Has the problem been addressed?

Hmm, I'll think about that.

P.S.

> I'm not sure identity is the right word, I would rather call it ID.

It's just a term, we can define it as we want. There was another option
to name it BuildID, but for some reason it was refused.

-- 
   WBR,
   Vladimir D. Seleznev