Debusine concepts

Artifacts

Artifacts are at the heart of Debusine. Artifacts are both inputs (submitted by users) and outputs (generated by tasks). An artifact combines:

  • an arbitrary set of files

  • arbitrary key-value data (stored as a JSON-encoded dictionary)

  • a category

The category is just a string identifier used to recognize artifacts sharing the same structure. You can create and use categories as you see fit but we have defined a basic ontology suited for the case of a Debian-based distribution.

Artifacts can have relations with other artifacts:

  • built-using: indicates that the build of the artifact used the target artifact (ex: “binary-packages” artifacts are built using “source-package” artifacts)

  • extends: indicates that the artifact is extending the target artifact in some way (ex: a “source-upload” artifact extends a “source-package” artifact with target distribution information)

  • relates-to: indicates that the artifact relates to another one in some way (ex: a “binary-upload” artifact relates-to a “binary-package”, or a “package-build-log” artifact relates to a “binary-package”).

Artifacts are not deleted:

  • as long as they are referenced by another artifact (through one of the above relationships)

  • as long as their expiration date is not over

  • as long as they are not manually deleted (if they don’t have any expiration date)

  • as long as they are referenced by items of a collection

Artifacts can have additional properties:

  • immutable: when set to True, nothing can be changed in the artifact through the API

  • creation timestamp: timestamp indicating when the artifact has been created

  • last updated timestamp: timestamp indicating when the artifact has been last modified/updated

The following operations are possible on artifacts:

  • create a new artifact

  • upload content of one of its file

  • set key-value data

  • attach/remove a file

  • add/remove a relationship

  • delete an artifact

Files in artifacts are content-addressed (stored by hash) in the database, so a single file can be referenced in multiple places without unnecessary data duplication.

Collections

A Collection is a set of artifacts or other collections that are intended to be used together. The following are some example use cases:

  • A suite in the Debian archive (e.g. “Debian bookworm”)

  • A Debian archive (a.k.a. repository) containing multiple suites

  • For a source package name, the latest version in each suite in Debian (compare https://tracker.debian.org/pkg/foo)

  • Results of a QA scan across all packages in unstable and experimental

  • Buildd-suitable debian:system-tarball artifacts for all Debian suites

  • Extracted .desktop files for each package name in a suite

Todo

Another possible idea is to use collections for the output of each task, either automatically or via a parameter to the task.

Collections have the following properties:

  • category: a string identifier indicating the structure of additional data; see the ontology

  • name: the name of the collection

  • workspace: defines access control and file storage for this collection; at present, all artifacts in the collection must be in the same workspace

  • full_history_retention_period, metadata_only_retention_period: optional time intervals to configure the retention of items in the collection after removal; see Retention of collection items for details

Collections are unique by category and name. They may be looked up by category and name, providing starting points for further lookups within collections.

Each item in a collection is a combination of some metadata and an optional reference to an artifact or another collection. The permitted categories for the artifact or collection are limited depending on the category of the containing collection. The metadata is as follows:

  • category: the category of the artifact, copied for ease of lookup and to preserve history

  • name: a name identifying the item, which will normally be derived automatically from some of its properties; only one item with a given name and an unset removal timestamp (i.e. an active item) may exist in any given collection

  • key-value data indicating additional properties of the item in the collection, stored as a JSON-encoded dictionary with a structure depending on the category of the collection; this data can:

    • provide additional data related to the item itself

    • provide additional data related to the associated artifact in the context of the collection (e.g. overrides for packages in suites)

    • override some artifact metadata in the context of the collection (e.g. vendor/codename of system tarballs)

    • duplicate some artifact metadata, to make querying easier and to preserve it as history even after the associated artifact has been expired (e.g. architecture of system tarballs)

  • audit log fields for changes in the item’s state:

    • timestamp (created_at), user (created_by_user), and workflow (created_by_workflow) for when it was created

    • timestamp (removed_at), user (removed_by_user), and workflow (removed_by_workflow) for when it was removed

This metadata may be retained even after a linked artifact has been expired (see Retention of collection items). This means that it is sometimes useful to design collection items to copy some basic information, such as package names and versions, from their linked artifacts for use when inspecting history.

The same artifact or collection may be present more than once in the same containing collection, with different properties. For example, this is useful when debusine needs to use the same artifact in more than one similar situation, such as a single system tarball that should be used for builds for more than one suite.

A collection may impose additional constraints on the items it contains, depending on its category. Some constraints may apply only to active items, while some may apply to all items. If a collection contains another collection, all relevant constraints are applied recursively.

Collections can be compared: for example, a collection of outputs of QA tasks can be compared with the collection of inputs to those tasks, making it easy to see which new tasks need to be scheduled to stay up to date.

Retention of collection items

Collection items and the artifacts they refer to may be retained in debusine’s database for some time after the item is removed from the collection, depending on the values of full_history_retention_period and metadata_only_retention_period. The sequence of events is as follows:

  • item is removed from collection: metadata and artifact are both still present

  • after full_history_retention_period, the link between the collection item and the artifact is removed: metadata is still present, but the artifact may be expired if nothing else prevents that from happening

  • after full_history_retention_period + metadata_only_retention_period, the collection item itself is deleted from the database: metadata is no longer present, so the history of the collection no longer records that the item in question was ever in the collection

If full_history_retention_period is not set, then artifacts in the collection and the files they contain are never expired. If metadata_only_retention_period is not set, then metadata-level history of items in the collection is never expired.

Updating collections

The purpose of some tasks is to update a collection. Those tasks must ensure that anything else looking at the collection always sees a consistent state, satisfying whatever invariants are defined for that collection. In most cases it is sufficient to ensure that the task does all its updates within a single database transaction. This may be impractical for some long-running tasks, and they might need to break up the updates into chunks instead; in such cases they must still be careful that the state of the collection at each transaction boundary is consistent.

To support automated QA at the scale of a distribution, some collections are derived automatically from other collections, and there are special arrangements for keeping those collections up to date. See Derived collections.

Workspaces

A Workspace is a concept tying together a set of Artifacts and a set of Users. Since Artifacts have to be stored somewhere, Workspaces also tie together the set of FileStore where files can be stored.

Workspaces have the following important properties:

  • public: a boolean which indicates whether the Artifacts are publicly accessible or if they are restricted to the users belonging to the workspace

  • default_expiration_delay: the minimal time (in days) that a new artifact is kept in the workspace before being expired. This value can be overridden in the artifact afterwards. If this value is 0, then Artifacts are never expired until they are manually removed.

  • default_file_store: the default FileStore where newly uploaded files are stored.

Tasks

Tasks are time-consuming operations that are typically offloaded to dedicated workers. They consume artifacts as input and generate artifacts as output. The generated artifacts automatically have built-using relationships linking to the artifacts used as input.

Tasks can require specific features from the workers on which it will run. This will be used to ensure things like:

  • architecture selection (when managing builders on different architectures)

  • required memory amount

  • required free disk space amount

  • availability of specific build chroot

Each category of task specifies whether it should run on a debusine-worker instance or on a shared server-side Celery worker. The latter must be used only for tasks that do not execute any user-supplied code, and it provides direct access to the Debusine database.

Tasks that run on debusine-worker instances are required to use the public API to interact with artifacts. They are passed a dedicated token that has the proper permissions to retrieve the required artifacts and to upload the generated artifacts.

Executor Backends

Debusine supports multiple different virtualisation backends to execute tasks. From lightweight containers (e.g. unshare) to VMs (e.g. incus-vm).

When tasks are executed in an executor backend, one of the task inputs is an environment, an artifact containing a system image that the task is executed in. These image artifacts are downloaded by the worker and cached locally. For some backends (e.g. Incus) they’ll be converted and/or imported into an image store.

The worker maintains an LRU cache of up to 10 images. When cleaning up images, they’ll also be removed from any relevant image stores.

Work Requests

Work Requests are the way Debusine schedules tasks to workers and monitors their progress and success.

Work Requests have the following important properties:

  • task_name: the name of the task to execute (used to figure out the Python class implementing the logic)

  • task_data: a JSON dict representing the input parameters for the task

  • status: the processing status of the work request. Allowed values are:

    • blocked: the task is not ready to be executed

    • pending: the task is ready to be executed and can be picked up by a worker

    • running: the task is currently being executed by a worker

    • aborted: the task has been cancelled/aborted

    • completed: the task has been completed

  • result: the processing result. Allowed values are:

    • success: the task completed and succeeded

    • failure: the task completed and failed

    • error: an unexpected error happened during execution

  • workspace: foreign key to the workspace where the task is executed

  • worker: a foreign key to the assigned worker (is NULL while work request is pending or blocked)

  • unblock_strategy: a field specifying how the work request can move from blocked to pending status. Supported values are:

    • deps: the work request can be unblocked once all the dependent work requests have completed

    • manual: the work request must be manually unblocked

  • dependencies: ManyToMany relation with other WorkRequest that need to complete before this one can be unblocked (if using the deps unblock_strategy)

  • parent: foreign key to the containing WorkRequest (or NULL when scheduled outside of a workflow). The parent hierarchy will eventually reach a node of type "workflow"` which is the node that manages this WorkRequest hierarchy. See Workflows.

  • workflow_data: JSON dict controlling some workflow specific behaviour

  • event_reactions: JSON dict describing actions to perform in response to specific events.

Blocked work requests using the deps unblock strategy may have dependencies on other work requests. Those are only used to control the order of execution of work requests inside workflows: the scheduler ignores blocked work requests and only considers pending work requests. The deps unblock strategy will change the status of the work request to pending when all its dependent work requests have completed.

Some work requests run on a Celery worker with direct access to the Debusine database, rather than on a less-privileged external worker.

Workflows

Workflows are advanced server-side logic that can schedule and combine server tasks and worker tasks: outputs of some work requests can become the input of other work requests, and the flow of execution can be influenced by the results of already executed work requests.

Workflows are powerful operations in particular due to their ability to run server tasks. Until finer grained access control is implemented, users can only start the subset of workflows that have been made available by the workspace administrator (by creating workflow templates). This process:

  • grants a unique name to the workflow so that it can be easily identified and started by users

  • defines all the input parameters that cannot be overridden when a user starts the workflow

Those workflow templates can then be turned into actual running workflows by users or external events, through the web interface or through the API.

The input parameters that are not set in the workflow template are called run-time parameters and they have to be provided by the user that starts the workflow. Those parameters are stored in a WorkRequest model with task_type workflow that will be used as the root of a WorkRequest hierarchy covering the whole duration of the process controlled by the workflow.

Once completed, the remaining lifetime of the workflow instances is controlled by their expiration date and the expiration of some associated artifacts.

To begin with, available workflows will be limited to those that are fully implemented in Debusine. In the future, we expect to add a more flexible approach where administrators can submit a fully customized logic combining various building blocks.

Here are some examples of possible workflows:

  • Package build: it would take a source package and a target distribution as input parameters, and the workflow would automate the following steps: { sbuild on all architectures supported in the target distribution } → add source and binary packages to target distribution.

    See sbuild workflow.

  • Package review: it would take a source package and associated binary packages and a target distribution, and the workflow would control the following steps: { generating debdiff between source packages, lintian, autopkgtest, autopkgtests of reverse-dependencies } → manual validation by reviewer → add source and binary packages to target distribution.

  • Both build and review could be combined in a larger workflow.

    In that case, the reverse-dependencies whose autopkgtests should be run cannot be identified until the sbuild task has completed, so the workflow would be expanded/reconfigured after that step completed.

  • Update a collection of lintian analyses of the latest packages in a given distribution based on the changes of the collection representing that distribution.

    Here again the set of lintian analyses to run depends on a first step of comparison between the two collections.

See Workflows for a list of available workflows.