Contribute a new workflow

Workflows are Debusine’s most powerful feature, and we expect other people to have ideas of interesting workflows to try out. Eventually we want to provide some kind of domain-specific language to create custom workflows directly in the system, but in the meantime you are welcome to contribute your own workflows and we will help you as needed.

Design

All new workflows need a blueprint. This should at minimum detail what task data the workflow accepts and what child work requests it creates; see existing workflows for examples of how to lay this out. If your workflow involves any new tasks, then the blueprint should also include detailed descriptions of those in the same way.

It will save you time to discuss your plans with the development team so that we can advise on what approaches are likely to work most smoothly. You can do this either before or after writing a draft blueprint.

Task breakdown

Start by thinking about how your workflow breaks down into tasks. Some good questions to ask yourself are:

  • Can they be offloaded to external workers (most tasks should be in this category), or do they need direct access to the Debusine database (perhaps because they involve analysing a collection)?

  • How finely does the work need to be divided into individual tasks?

  • Are there any existing tasks that can help you solve parts of your problem?

  • What artifacts will need to be passed between tasks, and what dependencies will you need to have between them?

Input to the workflow

You should define the task data that your workflow will accept. This is usually related to the task data accepted by the individual tasks that it will create, but it may be higher-level and more opinionated. A good rule of thumb is that tasks express mechanism, while workflows express policy.

Workflow task data is constructed by combining workflow templates defined in a workspace with user input, so it should be laid out in a way that will allow workspace owners to use workflow templates to set useful rules for how contributors may start workflows. In particular, this means that while workflow task data can be arbitrary data structures, you probably want to lean towards a flat layout; top-level items not set in a workflow template may be provided by the user who starts the workflow.

Implementation

Each workflow has a Pydantic model in debusine/server/workflows/models.py for its task data, and a corresponding orchestrator class under debusine/server/workflows/. At minimum, that class should look like this:

class FooWorkflow(Workflow[FooWorkflowData, BaseDynamicTaskData]):
    """Do stuff."""

    TASK_NAME = "foo"

    def build_dynamic_data(
        self, task_database: TaskDatabaseInterface
    ) -> BaseDynamicTaskData:
        """Compute dynamic data for this workflow."""
        return BaseDynamicTaskData(subject=...)

    def populate(self) -> None:
        """Create work requests."""

    def get_label(self) -> str:
        """Return the task label."""
        return "..."

The build_dynamic_data method is typically less complex for workflows than for tasks. In the workflow case, it usually just needs to compute a reasonable subject for a workflow instance based on self.data (see Task configuration) and perhaps also parameter_summary describing the most important parameters to the workflow for display in the web UI.

The populate method does the bulk of the orchestrator’s work. Usually, it looks up any artifacts or collections needed from self.data, decides which child work requests it needs to create, and calls self.work_request_ensure_child to create them. It may also call self.provides_artifact and/or self.requires_artifact to indicate dependencies between work requests in the workflow; take care that the name passed to self.provides_artifact is unique across all internal collection items under the same root workflow.

If your workflow creates sub-workflows, then it is your workflow’s responsibility to run their orchestrators in turn. That normally looks like this:

def populate(self) -> None:
    wr = self.work_request_ensure_child(...)
    wr.mark_running()
    orchestrate_workflow(wr)

In some exceptional cases the sub-workflow’s orchestrator may not be able to work out which child work requests to create until a dependency has completed. In such cases, the populate method should instead have this sort of structure:

def populate(self) -> None:
    wr = self.work_request_ensure_child(task_type=TaskTypes.WORKFLOW, ...)
    self.requires_artifact(wr, ...)
    if wr.status == WorkRequest.Statuses.PENDING:
        wr.mark_running()
        orchestrate_workflow(wr)

The get_label method returns a string used as a label for a workflow instance in the web UI.

You may find it helpful to consult some existing implementations for inspiration. The lintian workflow (debusine/server/workflows/lintian.py) is a relatively simple example that demonstrates some of the points here.

Documentation

Once you have implemented your workflow, make sure to move the corresponding documentation from the blueprint into the main documentation, usually under docs/reference/workflows/specs/.