Generating repository indexes
As the first step in adding repository hosting to Debusine, we need to be able to generate repository index files for debian:suite collections.
We should design the storage for these files in such a way as to make it easy to serve repositories directly from Debusine artifacts (as opposed to needing to write out the repositories to disk).
It is useful to serve historical snapshots as well as the current state of the archive, as long as Debusine still retains the necessary data under whatever retention policies are in effect. The collection data model already includes timestamps for when collection items were created or removed, and supports retaining items even after they are no longer active in their parent collection; as a result, repository indexes at a particular timestamp can be found by querying for collection items containing indexes that were not created after that timestamp and not removed before that timestamp.
To support race-free mirroring, index files are served via by-hash paths in addition to their base path. These paths are handled implicitly by the code that serves repositories, and are not recorded using separate collection items.
Category debian:repository-index
This artifact stores an index file in a repository: this covers any file that is not part of a source or binary package.
Data:
Files:
Relationships:
relates-to: for
Release
files, the other index files mentioned in themextends: for
Release.gpg
andInRelease
files, the corresponding unsignedRelease
file
Note that while in principle a Packages
or Sources
file relates to
all the individual packages mentioned in it, actually creating those
artifact relationships would result in a very large amount of churn in the
ArtifactRelation
table for relatively minimal benefit, so we
intentionally skip those.
Changes to existing collections
The debian:suite and debian:archive collections gain the following:
Valid items:
debian:repository-index
artifacts
Per-item data:
path
: for index files, the path of the file relative to the root of the suite’s directory indists
(e.g.InRelease
ormain/source/Sources.xz
)
Lookup names:
index:PATH
: the current index file atPATH
relative to the root of the suite’s directory indists
debian:suite also gains:
Data:
indexes_generated_at
: the transaction timestamp at which the GenerateSuiteIndexes task was most recently runduplicate_architecture_all
: if true, includeArchitecture: all
packages in architecture-specificPackages
indexes, and set No-Support-for-Architecture-all: Packages in theRelease
file; this may improve compatibility with older client code
Most index files are stored at the suite level: the code that serves an
archive as a whole will look at the appropriate suite when serving paths
under dists/SUITE/
. There are a few exceptions for files that are not
packages and not under dists/
, such as override summaries and mirror
traces; these are not consulted by apt
and so implementing them is not
urgent.
Overrides
Overrides are used by Debian’s traditional archive management software to store the component, section, and (for binary packages) priority of each package. (Ubuntu’s archive management software also supports phased update percentages, which are handled by overrides; other extensions are possible.)
While the name “override” might suggest that these are only applied where
the values are something other than the ones supplied by the package, in
fact every package in the archive has overrides even if those are equal to
the ones supplied by the package. In Debian’s traditional archive
management software, uploads of packages without overrides go into the
NEW
queue for manual review.
Review workflows of this kind are not yet in scope, but we already have
per-item data for component, section, and priority in the debian:suite collection which represent the common set of overrides
and are considered when generating Packages
and Sources
files.
Debian also publishes override summaries in an indices
directory; we
treat these as just another kind of repository index file, although they are
stored at the archive level rather than at the suite level.
GenerateSuiteIndexes task
This is a server-side task that generates
Packages
, Sources
, and Release
files (and their variants) for a
debian:suite collection.
The task_data
for this task may contain the following keys:
suite_collection
(Single lookup, required): the debian:suite collection to operate ongenerate_at
(datetime, required): generate indexes for packages that were in the suite at this timestamp
The task searches for packages that were in the suite at the given time,
builds the appropriate index files from them (setting Date
in the
Release
file to the same timestamp as generate_at
), and adds them to
the collection. It sets the created_at
fields of the new collection
items to generate_at
; if there are any index files in the collection
that are not the most recent ones (which may include the ones that were just
created!), then it sets removed_at
to the timestamp when the next-newest
index files were created.
Workflow update_suites
This workflow does whatever is needed to coordinate metadata updates for all
archives in a workspace. Initially this
will just involve generating basic indexes for each of the suites in those
archives that have been changed, but later it may also generate
supplementary files such as Contents-*
.
The workflow operates on the workspace in which it was created.
task_data
:force_basic_indexes
(boolean, defaults to False): if True, regenerate basic indexes (Packages
,Sources
,Release
, and their variants) even if the state of the archive does not seem to have changed since they were last generated
For each suite in the workspace, if any collection items have been added or
removed since its current indexes_generated_at
value, it creates a
GenerateSuiteIndexes task, with task
data as follows:
suite_collection
: the suite whose indexes should be generatedgenerate_at
: the transaction timestamp at which the workflow orchestrator is being run (this needs special care to preserve idempotency, since any later runs of the same workflow orchestrator would have a different transaction timestamp)
To begin with, this workflow can just be run periodically, although that will not scale well to large numbers of suites. We should eventually have a mechanism where changes to a collection can trigger a workflow.
Todo
Future work is
needed to sign Release
files, and will need to be integrated into
this task. That will probably involve creating a new sub-workflow that
can deal with generating and signing indexes for a single suite. Once
archives have been implemented, we may also
want to group all the updates for suites in a given archive under a
single sub-workflow.
Future work
This blueprint only covers the bare minimum needed to generate valid repository indexes, but real repositories often use more complex features. For instance:
Valid-Until can be supported by adding a field to the suite collection specifying the validity period. The workflow that decides which indexes need to be regenerated would additionally include all those whose validity period is nearly up in its search criteria.
Debian uses Extra-Source-Only: yes to indicate that a source package is only present in an index due to being referenced by a binary package in the suite (via
Built-Using
orSource
). Debusine has all the necessary information about which source and binary packages are in the suite and how they relate to each other, so it can add this field when generatingSources
files. (We may find that checking the relationships efficiently requires some additional database indexes.)