Debusine concepts
Debusine has been designed to run a network of generic “workers” that can perform various “tasks” producing “artifacts”. Interesting artifacts that we want to keep in the long term are stored in “collections”.
While tasks can be scheduled as individual “work requests”, the power of Debusine lies in its ability to combine multiple (different) tasks in “workflows”, where each workflow has its own logic to orchestrate multiple work requests across the available workers.
A Debusine instance can be multi-tenant, divided into “scopes” of users and groups. These contain “workspaces” that have their own sets of artifacts and collections. Workspaces can inherit from one another in order to share collections and artifacts when required.
Artifacts
Artifacts are at the heart of Debusine. They combine:
a set of files
key-value data (stored as a JSON-encoded dictionary)
a category
The category is just a string identifier used to recognize artifacts sharing the same structure. You can create and use categories as you see fit but we have defined a basic ontology suited for the case of a Debian-based distribution.
Artifacts are both inputs (submitted by users) and outputs (generated by tasks). They are created and stored in a workspace and can have an expiration delay controlling their lifetime. Artifacts are (mostly) immutable, they should never be modified after creation.
Files in artifacts are content-addressed (stored by hash) in the database, so a single file can be referenced in multiple places without unnecessary data duplication.
Files in artifacts have names that may include directories.
Artifacts can have relations with other artifacts, see artifact relationships.
Assets
Assets are typed holders of key-value data with strong permissions to encode how the data may be used. Similarly to Artifacts, the category is used to distinguish Assets sharing the same structure and purpose. See the ontology of possible assets.
Assets are used to store credentials (with permissions dictating who can see/edit/use those credentials). They are also used to represent external objects with permissions for the various operations that are possible on those objects. One example of these objects is a signing key (managed by a signing worker, possibly stored in a hardware security module), and the associated permissions encode who is able to generate signatures with that key.
Collections
A collection is an abstraction used to manage and store a coherent set of
“collection items”. Each collection has a category field describing
its intended use case, the allowed collection items, the associated
metadata, etc. See the reference.
A collection is meant to represent things like this:
A suite in the Debian archive (e.g. “Debian bookworm”): the debian:suite collection is a collection of debian:source-package and debian:binary-package artifacts.
A Debian archive (a.k.a. repository) that contains multiple suites: the debian:archive collection is a collection of debian:suite collections
Build chroots for all Debian suites: the debian:environments collection stores debian:system-tarball artifacts for multiple Debian suites
The results of a lintian analysis or autopkgtests runs across all the packages in a target suite
Extracted
.desktopfiles for each package name in a suite
To cover for those various cases, each collection item consists of some arbitrary metadata and can optionally link to an artifact or to a collection. Hence we define 3 kinds of collection items:
artifact-based items: they link an artifact with some metadata
collection-based items: they link a collection with some metadata
bare-data items: they only store some metadata
Each collection item has its own “category” that defines the nature of the item. For artifact-based items and collection-based items, it duplicates the category of the linked artifact or collection. For bare-data items, it indirectly defines the structure to expect in the metadata.
A collection item also has a unique name within the collection so that the collection can be seen like a big Python dictionary mapping names to artifacts, collections and arbitrary data.
Collections can be uniquely identified within a workspace by category and name, and can provide useful starting points for further lookups within collections.
To learn more about collections, you can read more details about their data model.
Tasks
Tasks are time-consuming operations that are typically offloaded to dedicated workers.
Debusine contains a library of tasks to perform various operations that are useful when you contribute to Debian or one of its derivatives (“build a package”, “run lintian”, “upload a package”, etc.).
The behaviour of each task can be controlled/customized with some input parameters. The combination of a task and actual input parameters constitutes a work request that can be scheduled to run.
There are six types of tasks but the most
interesting ones are the Worker, Server and Signing tasks.
Worker tasks
Worker tasks run on external workers, often within some controlled execution environments. They can only interact with Debusine through the public API. Hence they will typically only consume and produce artifacts, and create relationships between them.
Worker tasks can require specific features from the workers on which they will run. This is used to ensure that the assigned worker will have all the required resources for the task to succeed.
Signing tasks
Signing tasks are very much like worker tasks, except that they have access to a local database containing sensitive cryptographic material (i.e. private keys) that needs to be stored in a secure manner and whose access should be tightly controlled.
Server tasks
Server tasks perform operations that require direct database access and that may take some time to run. They run on Celery workers, and must not execute any user-controlled code.
Work requests
Work requests are the way Debusine schedules tasks to workers and monitors their progress and success. Basically it ties together a task (that is some code to execute on a worker) together with its parameters (values used to customize the behaviour of the task).
Note
There are different types of tasks, but they all share the same work request structure for the purpose of being scheduled. This includes workflows, thus much of what is said about work requests also apply to the concept of workflows even if we present workflows separately from tasks due to their special role in Debusine.
Worker tasks and workflows are the two types of tasks that can be scheduled individually by Debusine users. All the other types of tasks are restricted and can only be started indirectly through one of the workflows that is available in the workspace.
A work request is tied to a workspace. This defines what the task has access too and where its output will be stored. The artifacts generated as output by the task are linked to the work request and can be easily reused.
To learn more about work requests, you can read:
Work request scheduling for more explanations about how work requests are scheduled.
Work Requests for more information about the data model and all the special cases.
Workflows
Workflows bring advanced orchestration logic to Debusine: they combine multiple individual tasks in a meaningful way. A workflow can:
start multiple work requests
analyze their results to decide on the next steps
reuse the output of a work request as input for another work request
extract data from collections
feed data into collections
etc.
Here are some examples of possible workflows:
Package build: it would take a source package and a target distribution as input parameters, and the workflow would automate the following steps: { sbuild on all architectures supported in the target distribution } → add source and binary packages to target distribution.
See sbuild workflow.
Package review: it would take a source package and associated binary packages and a target distribution, and the workflow would control the following steps: { generating debdiff between source packages, lintian, autopkgtest, autopkgtests of reverse-dependencies } → manual validation by reviewer → add source and binary packages to target distribution.
Both build and review could be combined in a larger workflow.
In that case, the reverse-dependencies whose autopkgtests should be run cannot be identified until the sbuild task has completed, so the workflow would be expanded/reconfigured after that step completed.
Update a collection of lintian analyses of the latest packages in a given distribution based on the changes of the collection representing that distribution.
Here again the set of lintian analyses to run depends on a first step of comparison between the two collections.
Terminology
We often use the term “workflow” in different contexts to refer to different things. Workflows are a special kind of Work Request so we have the same distinction between a Task (the code) and Work Request (a running instance of the code with specific parameters). Here are the terms that we use, in the context of workflows, to distinguish between them:
Workflow Implementation: the code implementing the orchestration logic. Each workflow implementation uses many input parameters. They are documented in the reference documentation.
Workflow Instance: it’s a Work Request associating a Workflow Implementation to a set of input parameters. It has its own lifecycle from creation up to completion.
Workflow Template: it’s really a shortcut for a Workflow Instance Template. It is a pre-configured workflow provided by the workspace administrator that can be turned into workflow instances by users. More on this below.
Workflow Template
Workflows are powerful operations in particular due to their ability to run server tasks. Due to this, users cannot start arbitrary workflows, they can only start the subset of workflows that have been made available by the workspace administrator through workflow templates. A workflow template:
grants a unique name to a pre-configured workflow so that it can be easily identified and started by users
defines all the input parameters that cannot be overridden when a user starts the workflow
The input parameters that are not set in the workflow template are called run-time parameters and they have to be provided by the user that starts the workflow.
The resulting input parameters are stored in a WorkRequest model with
task_type workflow that will be used as the root of a WorkRequest
hierarchy covering the whole duration of the process controlled by the
workflow. See Workflow orchestration to learn more about how
child work requests are created.
File stores
Files in artifacts are stored in file stores. These are content-addressed: a file with a given SHA-256 digest is only stored once in any given store, and may be retrieved by that digest. When a new artifact is created, its files are uploaded to stores as needed. Some of the files may already be present. In that case, if the file is already part of the artifact’s workspace, then it does not need to be reuploaded; but otherwise, it must be reuploaded to avoid users obtaining unauthorized access to existing file contents.
Local storage is useful as the initial destination for uploads to Debusine, but it has to be backed up manually and might not scale to sufficiently large volumes of data. Remote storage such as S3 is also available. It is possible to serve a file from any store, with policies for which one to prefer for downloads and uploads.
Administrators can set policies for which file stores to use at the scope level, as well as policies for populating and draining stores of files. Most bulk movement is handled by a periodic job.
To learn more about file stores, see their reference.
Scopes
Scopes are the foundational concept used to implement multi-tenancy in Debusine. They are an administrative grouping of users, groups and workspaces. They appear as the initial segment in the URL path of most web views.
Groups and workspaces can only exist in a single scope. Users are global and might be part of multiple scopes.
Since artifacts have to be stored somewhere, scopes also define the set of file stores where files can be stored.
Workspaces
A workspace is an administrative concept hosting artifacts and collections. Users can get different levels of access to those artifacts and collections by being granted different roles on the workspace.
Workspaces have the following important properties:
public: a boolean which indicates whether the artifacts are publicly accessible or if they are restricted to the users belonging to the workspace
default_expiration_delay: the minimal duration that a new artifact is kept in the workspace before being expired. See Expiration of data.
To learn more about workspaces, see their reference.
Workers
Workers are services that run tasks on behalf of a Debusine server. There are three types of worker.
External workers
Most workers are external workers, running an instance of
debusine-worker.  This is a daemon that runs untrusted tasks using some
form of containerization or virtualization.  It has no direct access to the
Debusine database; instead, it interacts with the server using the HTTP API
and WebSockets.
External workers process one task at a time, and only process Worker
tasks.
To support spikes in work requests, Debusine is able to use Dynamic Worker Pools to host external workers in clouds. These are provisioned as required, and terminated when idle.
Celery workers
A Debusine instance normally has an associated Celery worker, which is used to run tasks that require direct access to the Debusine database. These tasks are necessarily trusted, so they must not involve running user-controlled code.
Celery workers have a concurrency level, normally set to the number of
logical CPUs in the system (os.cpu_count()).
Todo
Document (and possibly fix) what happens when workers are restarted while running a task.
Signing workers
Signing workers work in a similar way to external workers, but they have
access to private key material, either directly or via a hardware security
module.  They only process Signing tasks.