.. _deployment-architecture:

=======================
Deployment architecture
=======================

Debusine has the following major components that should be considered as
part of a deployment:

* `PostgreSQL <https://www.postgresql.org/>`__ database
* `Redis <https://github.com/redis/redis>`__ database
* :ref:`explanation-file-stores`
* Server
* Scheduler
* :ref:`explanation-workers`

PostgreSQL database
===================

Debusine stores most of its persistent data in PostgreSQL, including
artifact metadata, assets, collections, and work requests (see
:ref:`debusine-concepts`).  Debusine also uses it as a `Celery result
backend
<https://docs.celeryq.dev/en/stable/getting-started/backends-and-brokers/>`__.

PostgreSQL may be deployed separately from other components, and may be
deployed in a primary/replica arrangement for higher availability (although
Debusine does not yet support routing read-only requests to different
members of the cluster, so multiple instances does not yet provide any
performance advantages).  The server, the server's Celery workers, and the
scheduler all need read-write access to the database.

The database schema is maintained as part of Debusine, and is upgraded
automatically by regular package upgrades using Django migrations.

The PostgreSQL database must be backed up, preferably using `continuous
archiving
<https://www.postgresql.org/docs/current/continuous-archiving.html>`__ to
allow for point-in-time recovery.  The details are out of the scope of this
document, but typical approaches include `WAL-G
<https://github.com/wal-g/wal-g>`__ and `pgBackRest
<https://pgbackrest.org/>`__.

Redis database
==============

Debusine uses Redis as a `Celery broker
<https://docs.celeryq.dev/en/stable/getting-started/backends-and-brokers/>`__:
this transports messages from the server or the scheduler to the Celery
workers, in order to execute server-side asynchronous tasks.

Redis also acts as a `channel layer
<https://channels.readthedocs.io/en/latest/topics/channel_layers.html>`__,
which is part of how Debusine communicates with its workers.  For example,
when a work request is assigned to a worker or when a worker completes a
work request, Debusine sends notifications to other parts of itself via
Redis.

Redis may be deployed separately from other components.  The server, the
server's Celery workers, and the scheduler all need access to it.

The Redis database does not need to be backed up.  The worst case is that
some ephemeral messages are lost and that some tasks may need to be retried.

File stores
===========

Files in :ref:`artifacts <explanation-artifacts>` may be large, and so they
are stored in separate file stores rather than in PostgreSQL.  These files
are content-addressed, and the stores may be local or remote.  File stores
are configured by the instance administrator at the :ref:`scope
<explanation-scopes>` level.

Local storage is the default and is suitable for small installations, but it
must be backed up manually.  It must be on the same machine as the server,
and so can only be used if the server is running on a single machine.

Remote storage such as S3 is typically better for larger installations, and
is required if the server needs to be scaled across multiple machines.  An
installation might rely on the provider's redundancy, or might run separate
backups using tools such as `rclone <https://rclone.org/>`__.

A scope may have multiple file stores with various :ref:`policies
<file-store-policies>`, which may be used to improve redundancy by storing
the same files in multiple independent places.

Each file has a corresponding row in PostgreSQL, which may be referred to by
artifacts.  In the event of disaster recovery, it is possible that the state
of the overall system may have changed between the most recent PostgreSQL
backup and the most recent file store backup, and it may not be possible to
achieve perfect consistency.  However, this should only affect files
uploaded and database changes made since the most recent backup.  In such a
situation, an administrator can minimize data loss by restoring the most
recent backup of each of PostgreSQL and the file stores, and relying on the
regular ``debusine-admin delete_expired`` and ``debusine-admin
vacuum_storage`` jobs to clean up or report on inconsistencies.

Server
======

The server can be installed using the ``debusine-server`` package.  It is a
WSGI/ASGI server deployed using `Gunicorn <https://gunicorn.org/>`__ and
`Uvicorn <https://uvicorn.dev/>`__.  By default, the server runs ``(2 × CPU
cores) + 1`` workers; this may need to be `tuned
<https://gunicorn.org/design/#scaling>`__ on individual systems, especially
those with many cores.

The server includes configuration for running behind an `nginx
<https://nginx.org/>`__ instance on the same machine, which is recommended.
On larger installations, it is normally appropriate to also place a load
balancer such as `haproxy <https://www.haproxy.org/>`__ in front of the
server.

The server may in principle be deployed on multiple machines with a load
balancer to distribute traffic between them, as long as a local file store
is not in use.  This is not yet well-tested in production.

Unless a local file store is in use or PostgreSQL is being run on the same
machine, the server does not store persistent data, although its
configuration should be backed up.

systemd units:

.. list-table::

    * - ``debusine-server.service``
      - Main server
    * - ``debusine-server-migrate.service``
      - Run database schema migrations on upgrade
    * - ``debusine-server-provisioner.service``
      - Dynamic worker provisioner
    * - ``debusine-server-delete-expired.timer``
      - Regular cleanup of expired or unused objects (timer)
    * - ``debusine-server-delete-expired.service``
      - Regular cleanup of expired or unused objects
    * - ``debusine-server-vacuum-storage.timer``
      - Regular storage maintenance (timer)
    * - ``debusine-server-vacuum-storage.service``
      - Regular storage maintenance

Scheduler
=========

The scheduler is responsible for distributing tasks to workers.  It runs as
a Celery service, using Redis as its broker and PostgreSQL as its result
backend.

While in theory the scheduler could be deployed separately from the server,
in practice it is part of the ``debusine-server`` package and it is normally
simplest to run it on the same machine as the server.  Only one scheduler
instance may run at a time; if the server is running on multiple machines,
then ``systemctl mask debusine-server-scheduler.service`` should be run on
all but one of them.

The scheduler does not store persistent data.  It shares configuration with
the server, and so that configuration should be backed up.

systemd units:

.. list-table::

    * - ``debusine-server-scheduler.service``
      - Scheduler

Workers
=======

External workers
----------------

Each external worker runs on a separate machine, installed using the
``debusine-worker`` package.  The various ``Recommends`` of that package are
optional in the sense that the worker can still function without them, but
it will not be able to execute all tasks, so typical worker installations
should not use APT options such as ``--no-install-recommends``.

External workers must be able to communicate with the server over HTTPS, and
must be able to download whatever resources tasks need.  In typical
installations this will include broad outbound internet access.

Installations may use static workers, or may use ``debusine-admin
worker_pool`` to manage :ref:`dynamic worker pools <dynamic-worker-pools>`
on cloud providers.  A "provisioner" service on the server is responsible
for auto-scaling dynamic workers; as with the scheduler, if the server is
running on multiple machines, then ``systemctl mask
debusine-server-provisioner.service`` should be run on all but one of them.
If possible, it is usually best to maintain at least a small number of
static workers to handle base load, and reserve dynamic workers for bursts.

External workers do not store persistent data, although their configuration
should be backed up.

systemd units:

.. list-table::

    * - ``debusine-worker.service``
      - External worker

Celery workers
--------------

Celery workers handle trusted server-side tasks that require direct access
to the Debusine database.  While in theory a Celery worker could be deployed
separately from the server, in practice it is part of the
``debusine-server`` package and it is normally simplest to run it on the
same machine(s) as the server.

Celery workers do not store persistent data.  They share configuration with
the server, and so that configuration should be backed up.

systemd units:

.. list-table::

    * - ``debusine-server-celery.service``
      - Celery worker
    * - ``debusine-server-periodic-tasks.service``
      - Run periodic Celery tasks

Signing workers
---------------

Signing workers are responsible for handling sensitive private key material.
They require a separate PostgreSQL database.  Private keys may be stored in
that database directly, encrypted at rest using a key known to the signing
workers; alternatively, there is some support for using private keys in a
hardware security module (currently only for UEFI Secure Boot).

Signing workers must be able to communicate with the server over HTTPS, and
with their dedicated PostgreSQL database; they should not have general
internet access.  The PostgreSQL database used by signing workers must be
firewalled off from everything other than the signing workers themselves.

If a hardware security module is in use, it is currently best to use only a
single signing worker, as the scheduler does not yet know how to route tasks
to the correct worker; we expect to remove this limitation in future.

The contents of the PostgreSQL database used by signing workers should be
backed up in the same way as the main database, although it changes much
less often.  Signing workers do not otherwise store persistent data,
although their configuration should be backed up.

systemd units:

.. list-table::

    * - ``debusine-signing.service``
      - Signing worker
    * - ``debusine-signing-migrate.service``
      - Run signing database schema migrations on upgrade

Monitoring
==========

Debusine supports `Prometheus <https://prometheus.io/>`__ monitoring using
the ``/api/1.0/open-metrics/`` endpoint.  Freexian uses this to maintain
various `Grafana <https://grafana.com/>`__ graphs such as request latency,
queue lengths, and worker pool sizes, although we do not currently provide
reusable graphs.

Ansible configuration
=====================

Freexian provides several `Ansible roles
<https://salsa.debian.org/freexian-team/ansible/debian-infrastructure>`__
that may be useful to teams deploying Debusine, either directly or as a
reference.