====================
OpenMetrics endpoint
====================

Currently Debusine provides a JSON statistics endpoint at
``/api/1.0/service-status/``.
This can be used to monitor instance performance, via a custom scraper.

Prometheus allows simple automated scraping and visualization of service
metrics, via a simple text format, which is being standardized as
`OpenMetrics
<https://github.com/prometheus/OpenMetrics/blob/main/specification/OpenMetrics.md>`_.

Debusine should provide an OpenMetrics endpoint at
``/api/1.0/open-metrics/``.

Security
--------

The goal of the endpoint is to be public, and not leak anything private
going on in private workspaces or scopes.

It will leak the type and quantity of tasks in queues, which could be
used to determine something about what's happening in private
workspaces.
The hope would be that if there's enough happening on a debusine
instance, the noise would drown out any ability to infer what is being
executed from these metrics.

The endpoint is not scoped, as we're interested in system-wide
statistics.
It would probably work perfectly as a scoped endpoint, so if a scoped
token is supplied, it could filter to that scope.
(This is not a requirement, it is out of scope.)
Workers tend not to be scoped, so they'd have the same values across
scopes.

The endpoint doesn't need to require any auth, but if performance is an
issue, it may be necessary to require auth, or to cache the response for
a short amount of time (maybe 30 seconds).

Metrics
-------

The endpoint should include these metrics:

* ``artifacts``: Gauge of the number of :ref:`artifact-reference`.
  Labels:

  * ``category``: From ``Artifact.category``.
  * ``scope``: From ``Artifact.workspace.scope.name``

* ``assets``: Gauge of :ref:`assets`. Labels:

  * ``category``: From ``Asset.category``.
  * ``scope``: From ``Asset.workspace.scope.name``

* ``collections``: Gauge of :ref:`collections <collection-reference>`. Labels:

  * ``category``: From ``Collection.category``.
  * ``scope``: From ``Collection.workspace.scope.name``

* ``file_store_size``: Gauge of the ``total_size`` of
  :ref:`file-store-reference`. Labels:

  * ``backend``: From ``FileStore.backend``.
  * ``name``: From ``FileStore.name``.

* ``file_store_max_size``: Gauge of the ``max_size`` of
  :ref:`file-store-reference`. Labels:

  * ``backend``: From ``FileStore.backend``.
  * ``name``: From ``FileStore.name``.

* ``groups``: Gauge of the number of Groups. Labels:

  * ``ephemeral``: ``1`` or ``0``, from ``Group.ephemeral``.

* ``tokens``: Gauge of Tokens. Labels:

  * ``enabled``: ``1`` or ``0``, from ``Token.enabled``.
  * ``user``: ``1`` if ``Token.user`` is not null, otherwise ``0``.
  * ``worker``: ``1`` if ``Token.worker`` is not null, otherwise ``0``.

* ``users``: Gauge of the number of Users. Labels:

  * ``active``: ``1`` or ``0``, from ``User.active``.

* ``user_activity``: Histogram of the number of Users, bucketed by the
  number of days since they last created a workflow. Buckets:

  * <= 1
  * <= 3
  * <= 7
  * <= 14
  * <= 30
  * <= 90
  * <= 365
  * <= ∞

  Labels:

  * ``scope``: The scope that the user created a workflow in.

* ``user_identities``: Gauge of the number of User SSO connections.
  Labels:

  * ``active``: ``1`` or ``0``, from ``User.active``.
  * ``issuer``: From ``Identity.issuer``.

* ``user_identities_activity``: Histogram of the number of Users,
  bucketed by the number of days since they last created a workflow.
  Buckets:

  * <= 1
  * <= 3
  * <= 7
  * <= 14
  * <= 30
  * <= 90
  * <= 365
  * <= ∞

  Labels:

  * ``issuer``: From ``Identity.issuer``.
  * ``scope``: The scope that the user created a workflow in.

* ``work_requests``: Gauge of :ref:`work-requests`. Labels:

  * ``task_type``: From ``WorkRequest.task_type``.
  * ``task_name``: From ``WorkRequest.task_name``.
  * ``scope``: From ``WorkRequest.workspace.scope.name``
  * ``status``: From ``WorkRequest.status``.
  * ``host_architecture``: From ``WorkRequest.data.host_architecture``
    (only for worker tasks).
  * ``backend``: From ``WorkRequest.data.backend`` (only for worker
    tasks).

* ``workers``: Gauge of workers. Labels:

  * ``connected``: ``1`` if ``Worker.connected_at`` is not NULL,
    otherwise ``0``.
  * ``busy``: ``1`` if ``Worker.is_busy``, otherwise ``0``.
  * ``worker_type``: From ``Worker.worker_type``.
  * ``worker_pool``: From ``Worker.worker_pool`` (only for pool
    workers).
  * ``architecture_$ARCH``: ``1`` if ``$ARCH`` is in
    ``Worker.metadata.system:architectures``, otherwise ``0`` (only for
    external workers).
  * ``host_architecture``: from
    ``Worker.metadata.system:host_architecture`` (only for external
    workers).

* ``worker_pool_runtime``: Gauge of total runtime of
  :ref:`dynamic-worker-pools`. Labels:

  * ``worker_pool``: From ``Worker.worker_pool`` (only for pool
    workers).

* ``workflow_templates``: Gauge of the number of
  :ref:`workflow-template`. Labels:

  * ``task_name``: From ``WorkflowTemplate.task_name``.
  * ``scope``: From ``WorkflowTemplate.workspace.scope.name``

* ``workspaces``: Gauge of workspaces. Labels:

  * ``private``: ``1`` or ``0``, from ``Workspace.public``.
  * ``expires``: ``1`` if ``Workspace.expiration_delay`` is set,
    otherwise ``0``.

Example
-------

The work request and worker metrics are a little more complex, so here's
an example of how they'd render:

.. code-block::

    TYPE work_requests gauge
    HELP work_requests Number of known Work Requests
    work_requests{task_type="Worker", task_name="sbuild", status="running", host_architecture="amd64", backend="auto", scope="debian"} 1
    work_requests{task_type="Worker", task_name="sbuild", status="pending", host_architecture="amd64", backend="auto", scope="debian"} 3
    work_requests{task_type="Worker", task_name="sbuild", status="pending", host_architecture="amd64", backend="auto", scope="another"} 2
    work_requests{task_type="Worker", task_name="sbuild", status="pending", host_architecture="amd64", backend="unshare", scope="debian"} 2
    work_requests{task_type="Worker", task_name="sbuild", status="pending", host_architecture="i386", backend="auto", scope="debian"} 1
    work_requests{task_type="Workflow", task_name="sbuild", status="running", scope="debian"} 1

    TYPE workers gauge
    HELP workers Number of known workers
    workers{connected="1", busy="1", worker_type="external", architecture_amd64="1", architecture_i386="1", host_architecture="amd64"} 1
    workers{connected="0", busy="0", worker_type="external", architecture_amd64="1", architecture_i386="1", host_architecture="amd64"} 5
    workers{connected="1", busy="0", worker_type="external", architecture_arm64="1", architecture_armhf="1", architecture_armel="1", host_architecture="arm64"} 5
    workers{connected="1", busy="0", worker_type="celery"} 1

Future work
-----------

There are some more metrics we could include, but they get expensive to
calculate without triggers and a summary table, or calculating a number
on startup, and then maintaining running totals inside e.g. Redis.

* ``artifact_file_size``: Gauge of the total size of files in artifacts in bytes.  Labels:

  * ``category``: From ``Artifact.category``.
  * ``scope``: From ``Artifact.workspace.scope.name``.

.. code-block:: sql

    SELECT
      SUM(size),
      category,
      db_scope.name AS scope
    FROM db_file
    INNER JOIN db_fileinartifact ON (db_file.id = db_fileinartifact.file_id)
    INNER JOIN db_artifact ON (db_fileinartifact.artifact_id = db_artifact.id)
    INNER JOIN db_workspace ON (db_workspace.id = db_artifact.workspace_id)
    INNER JOIN db_scope ON (db_scope.id = db_workspace.scope_id)
    GROUP BY
      category,
      scope
    ;
    ...

    (12 rows)
    Time: 1632.224 ms (00:01.632)

Histograms
----------

There are a few more metrics that could make useful histograms.
For now we leave these as potential future improvements.

* ``task_execution_times``: extracted from our :ref:`runtime-statistics`.
* ``artifact_file_counts``: number of files per artifact.
* ``artifact_file_sizes``: file sizes for artifact files.
* ``collection_sizes``: number of items in collections.
* ``tokens_last_seen``: ``now - last_seen_at`` for enabled tokens.