Debusine concepts

Debusine has been designed to run a network of generic “workers” that can perform various “tasks” producing “artifacts”. Interesting artifacts that we want to keep in the long term are stored in “collections”.

While tasks can be scheduled as individual “work requests”, the power of Debusine lies in its ability to combine multiple (different) tasks in “workflows”, where each workflow has its own logic to orchestrate multiple work requests across the available workers.

A Debusine instance can be multi-tenant, divided into “scopes” of users and groups. These contain “workspaces” that have their own sets of artifacts and collections. Workspaces can inherit from one another in order to share collections and artifacts when required.

Artifacts

Artifacts are at the heart of Debusine. They combine:

a set of files
key-value data (stored as a JSON-encoded dictionary)
a category

The category is just a string identifier used to recognize artifacts sharing the same structure. You can create and use categories as you see fit but we have defined a basic ontology suited for the case of a Debian-based distribution.

Artifacts are both inputs (submitted by users) and outputs (generated by tasks). They are created and stored in a workspace and can have an expiration delay controlling their lifetime. Artifacts are (mostly) immutable, they should never be modified after creation.

Files in artifacts are content-addressed (stored by hash) in the database, so a single file can be referenced in multiple places without unnecessary data duplication.

Files in artifacts have names that may include directories.

Artifacts can have relations with other artifacts, see artifact relationships.

Assets

Assets are holders of data, like Artifacts. They don’t store any files, but like assets, they have:

a category
key-value data (stored as a JSON-encoded dictionary)

Assets are used to manage signing keys. They have strong permissions associated with them, that may link them to workspaces.

Collections

A collection is an abstraction used to manage and store a coherent set of “collection items”. Each collection has a category field describing its intended use case, the allowed collection items, the associated metadata, etc. See the reference.

A collection is meant to represent things like this:

A suite in the Debian archive (e.g. “Debian bookworm”): the debian:suite collection is a collection of debian:source-package and debian:binary-package artifacts.
A Debian archive (a.k.a. repository) that contains multiple suites: the debian:archive collection is a collection of debian:suite collections
Build chroots for all Debian suites: the debian:environments collection stores debian:system-tarball artifacts for multiple Debian suites
The results of a lintian analysis or autopkgtests runs across all the packages in a target suite
Extracted .desktop files for each package name in a suite

To cover for those various cases, each collection item consists of some arbitrary metadata and can optionally link to an artifact or to a collection. Hence we define 3 kinds of collection items:

artifact-based items: they link an artifact with some metadata
collection-based items: they link a collection with some metadata
bare-data items: they only store some metadata

Each collection item has its own “category” that defines the nature of the item. For artifact-based items and collection-based items, it duplicates the category of the linked artifact or collection. For bare-data items, it indirectly defines the structure to expect in the metadata.

A collection item also has a unique name within the collection so that the collection can be seen like a big Python dictionary mapping names to artifacts, collections and arbitrary data.

Collections can be uniquely identified within a workspace by category and name, and can provide useful starting points for further lookups within collections.

To learn more about collections, you can read more details about their data model.

Tasks

Tasks are time-consuming operations that are typically offloaded to dedicated workers.

Debusine contains a library of tasks to perform various operations that are useful when you contribute to Debian or one of its derivatives (“build a package”, “run lintian”, “upload a package”, etc.).

The behaviour of each task can be controlled/customized with some input parameters. The combination of a task and actual input parameters constitutes a work request that can be scheduled to run.

There are six types of tasks but the most interesting ones are the Worker, Server and Signing tasks.

Worker tasks

Worker tasks run on external workers, often within some controlled execution environments. They can only interact with Debusine through the public API. Hence they will typically only consume and produce artifacts, and create relationships between them.

Worker tasks can require specific features from the workers on which they will run. This is used to ensure that the assigned worker will have all the required resources for the task to succeed.

Signing tasks

Signing tasks are very much like worker tasks, except that they have access to a local database containing sensitive cryptographic material (i.e. private keys) that needs to be stored in a secure manner and whose access should be tightly controlled.

Server tasks

Server tasks perform operations that require direct database access and that may take some time to run. They run on Celery workers, and must not execute any user-controlled code.

Work requests

Work requests are the way Debusine schedules tasks to workers and monitors their progress and success. Basically it ties together a task (that is some code to execute on a worker) together with its parameters (values used to customize the behaviour of the task).

Note

There are different types of tasks, but they all share the same work request structure for the purpose of being scheduled. This includes workflows, thus much of what is said about work requests also apply to the concept of workflows even if we present workflows separately from tasks due to their special role in Debusine.

Worker tasks and workflows are the two types of tasks that can be scheduled individually by Debusine users. All the other types of tasks are restricted and can only be started indirectly through one of the workflows that is available in the workspace.

A work request is tied to a workspace. This defines what the task has access too and where its output will be stored. The artifacts generated as output by the task are linked to the work request and can be easily reused.

To learn more about work requests, you can read:

Work request scheduling for more explanations about how work requests are scheduled.
Work Requests for more information about the data model and all the special cases.

Workflows

Workflows are advanced server-side logic that can schedule and combine server tasks and worker tasks: outputs of some work requests can become the input of other work requests, and the flow of execution can be influenced by the results of already executed work requests.

Workflows are powerful operations in particular due to their ability to run server tasks. Until finer grained access control is implemented, users can only start the subset of workflows that have been made available by the workspace administrator (by creating workflow templates). This process:

grants a unique name to the workflow so that it can be easily identified and started by users
defines all the input parameters that cannot be overridden when a user starts the workflow

Those workflow templates can then be turned into actual running workflows by users or external events, through the web interface or through the API.

The input parameters that are not set in the workflow template are called run-time parameters and they have to be provided by the user that starts the workflow. Those parameters are stored in a WorkRequest model with task_type workflow that will be used as the root of a WorkRequest hierarchy covering the whole duration of the process controlled by the workflow.

Once completed, the remaining lifetime of the workflow instances is controlled by their expiration date and the expiration of some associated artifacts.

To begin with, available workflows will be limited to those that are fully implemented in Debusine. In the future, we expect to add a more flexible approach where administrators can submit a fully customized logic combining various building blocks.

Here are some examples of possible workflows:

Package build: it would take a source package and a target distribution as input parameters, and the workflow would automate the following steps: { sbuild on all architectures supported in the target distribution } → add source and binary packages to target distribution.

See sbuild workflow.

Package review: it would take a source package and associated binary packages and a target distribution, and the workflow would control the following steps: { generating debdiff between source packages, lintian, autopkgtest, autopkgtests of reverse-dependencies } → manual validation by reviewer → add source and binary packages to target distribution.

Both build and review could be combined in a larger workflow.

In that case, the reverse-dependencies whose autopkgtests should be run cannot be identified until the sbuild task has completed, so the workflow would be expanded/reconfigured after that step completed.

Update a collection of lintian analyses of the latest packages in a given distribution based on the changes of the collection representing that distribution.

Here again the set of lintian analyses to run depends on a first step of comparison between the two collections.

See Workflow orchestration for more on how they work, and Workflows for a list of available workflows.

File stores

Files in artifacts are stored in file stores. These are content-addressed: a file with a given SHA-256 digest is only stored once in any given store, and may be retrieved by that digest. When a new artifact is created, its files are uploaded to stores as needed. Some of the files may already be present. In that case, if the file is already part of the artifact’s workspace, then it does not need to be reuploaded; but otherwise, it must be reuploaded to avoid users obtaining unauthorized access to existing file contents.

Local storage is useful as the initial destination for uploads to Debusine, but it has to be backed up manually and might not scale to sufficiently large volumes of data. Remote storage such as S3 is also available. It is possible to serve a file from any store, with policies for which one to prefer for downloads and uploads.

Administrators can set policies for which file stores to use at the scope level, as well as policies for populating and draining stores of files. Most bulk movement is handled by a periodic job.

To learn more about file stores, see their reference.

Scopes

Scopes are the foundational concept used to implement multi-tenancy in Debusine. They are an administrative grouping of users, groups and workspaces. They appear as the initial segment in the URL path of most web views.

Groups and workspaces can only exist in a single scope. Users are global and might be part of multiple scopes.

Since artifacts have to be stored somewhere, scopes also define the set of file stores where files can be stored.

Workspaces

A workspace is an administrative concept hosting artifacts and collections. Users can get different levels of access to those artifacts and collections by being granted different roles on the workspace.

Workspaces have the following important properties:

public: a boolean which indicates whether the artifacts are publicly accessible or if they are restricted to the users belonging to the workspace
default_expiration_delay: the minimal duration that a new artifact is kept in the workspace before being expired. See Expiration of data.

To learn more about workspaces, see their reference.

Workers

Workers are services that run tasks on behalf of a Debusine server. There are three types of worker.

External workers

Most workers are external workers, running an instance of debusine-worker. This is a daemon that runs untrusted tasks using some form of containerization or virtualization. It has no direct access to the Debusine database; instead, it interacts with the server using the HTTP API and WebSockets.

External workers process one task at a time, and only process Worker tasks.

To support spikes in work requests, Debusine is able to use Dynamic Worker Pools to host external workers in clouds. These are provisioned as required, and terminated when idle.

Celery workers

A Debusine instance normally has an associated Celery worker, which is used to run tasks that require direct access to the Debusine database. These tasks are necessarily trusted, so they must not involve running user-controlled code.

Celery workers have a concurrency level, normally set to the number of logical CPUs in the system (os.cpu_count()).

Todo

Document (and possibly fix) what happens when workers are restarted while running a task.

Signing workers

Signing workers work in a similar way to external workers, but they have access to private key material, either directly or via a hardware security module. They only process Signing tasks.