==================== OpenMetrics endpoint ==================== Currently Debusine provides a JSON statistics endpoint at ``/api/1.0/service-status/``. This can be used to monitor instance performance, via a custom scraper. Prometheus allows simple automated scraping and visualization of service metrics, via a simple text format, which is being standardized as `OpenMetrics `_. Debusine should provide an OpenMetrics endpoint at ``/api/1.0/open-metrics/``. Security -------- The goal of the endpoint is to be public, and not leak anything private going on in private workspaces or scopes. It will leak the type and quantity of tasks in queues, which could be used to determine something about what's happening in private workspaces. The hope would be that if there's enough happening on a debusine instance, the noise would drown out any ability to infer what is being executed from these metrics. The endpoint is not scoped, as we're interested in system-wide statistics. It would probably work perfectly as a scoped endpoint, so if a scoped token is supplied, it could filter to that scope. (This is not a requirement, it is out of scope.) Workers tend not to be scoped, so they'd have the same values across scopes. The endpoint doesn't need to require any auth, but if performance is an issue, it may be necessary to require auth, or to cache the response for a short amount of time (maybe 30 seconds). Metrics ------- The endpoint should include these metrics: * ``artifacts``: Gauge of the number of :ref:`artifact-reference`. Labels: * ``category``: From ``Artifact.category``. * ``scope``: From ``Artifact.workspace.scope.name`` * ``assets``: Gauge of :ref:`assets`. Labels: * ``category``: From ``Asset.category``. * ``scope``: From ``Asset.workspace.scope.name`` * ``collections``: Gauge of :ref:`collections `. Labels: * ``category``: From ``Collection.category``. * ``scope``: From ``Collection.workspace.scope.name`` * ``file_store_size``: Gauge of the ``total_size`` of :ref:`file-store-reference`. Labels: * ``backend``: From ``FileStore.backend``. * ``name``: From ``FileStore.name``. * ``file_store_max_size``: Gauge of the ``max_size`` of :ref:`file-store-reference`. Labels: * ``backend``: From ``FileStore.backend``. * ``name``: From ``FileStore.name``. * ``groups``: Gauge of the number of Groups. Labels: * ``ephemeral``: ``1`` or ``0``, from ``Group.ephemeral``. * ``tokens``: Gauge of Tokens. Labels: * ``enabled``: ``1`` or ``0``, from ``Token.enabled``. * ``user``: ``1`` if ``Token.user`` is not null, otherwise ``0``. * ``worker``: ``1`` if ``Token.worker`` is not null, otherwise ``0``. * ``users``: Gauge of the number of Users. Labels: * ``active``: ``1`` or ``0``, from ``User.active``. * ``user_activity``: Histogram of the number of Users, bucketed by the number of days since they last created a workflow. Buckets: * <= 1 * <= 3 * <= 7 * <= 14 * <= 30 * <= 90 * <= 365 * <= ∞ Labels: * ``scope``: The scope that the user created a workflow in. * ``user_identities``: Gauge of the number of User SSO connections. Labels: * ``active``: ``1`` or ``0``, from ``User.active``. * ``issuer``: From ``Identity.issuer``. * ``user_identities_activity``: Histogram of the number of Users, bucketed by the number of days since they last created a workflow. Buckets: * <= 1 * <= 3 * <= 7 * <= 14 * <= 30 * <= 90 * <= 365 * <= ∞ Labels: * ``issuer``: From ``Identity.issuer``. * ``scope``: The scope that the user created a workflow in. * ``work_requests``: Gauge of :ref:`work-requests`. Labels: * ``task_type``: From ``WorkRequest.task_type``. * ``task_name``: From ``WorkRequest.task_name``. * ``scope``: From ``WorkRequest.workspace.scope.name`` * ``status``: From ``WorkRequest.status``. * ``host_architecture``: From ``WorkRequest.data.host_architecture`` (only for worker tasks). * ``backend``: From ``WorkRequest.data.backend`` (only for worker tasks). * ``workers``: Gauge of workers. Labels: * ``connected``: ``1`` if ``Worker.connected_at`` is not NULL, otherwise ``0``. * ``busy``: ``1`` if ``Worker.is_busy``, otherwise ``0``. * ``worker_type``: From ``Worker.worker_type``. * ``worker_pool``: From ``Worker.worker_pool`` (only for pool workers). * ``architecture_$ARCH``: ``1`` if ``$ARCH`` is in ``Worker.metadata.system:architectures``, otherwise ``0`` (only for external workers). * ``host_architecture``: from ``Worker.metadata.system:host_architecture`` (only for external workers). * ``worker_pool_runtime``: Gauge of total runtime of :ref:`dynamic-worker-pools`. Labels: * ``worker_pool``: From ``Worker.worker_pool`` (only for pool workers). * ``workflow_templates``: Gauge of the number of :ref:`workflow-template`. Labels: * ``task_name``: From ``WorkflowTemplate.task_name``. * ``scope``: From ``WorkflowTemplate.workspace.scope.name`` * ``workspaces``: Gauge of workspaces. Labels: * ``private``: ``1`` or ``0``, from ``Workspace.public``. * ``expires``: ``1`` if ``Workspace.expiration_delay`` is set, otherwise ``0``. Example ------- The work request and worker metrics are a little more complex, so here's an example of how they'd render: .. code-block:: TYPE work_requests gauge HELP work_requests Number of known Work Requests work_requests{task_type="Worker", task_name="sbuild", status="running", host_architecture="amd64", backend="auto", scope="debian"} 1 work_requests{task_type="Worker", task_name="sbuild", status="pending", host_architecture="amd64", backend="auto", scope="debian"} 3 work_requests{task_type="Worker", task_name="sbuild", status="pending", host_architecture="amd64", backend="auto", scope="another"} 2 work_requests{task_type="Worker", task_name="sbuild", status="pending", host_architecture="amd64", backend="unshare", scope="debian"} 2 work_requests{task_type="Worker", task_name="sbuild", status="pending", host_architecture="i386", backend="auto", scope="debian"} 1 work_requests{task_type="Workflow", task_name="sbuild", status="running", scope="debian"} 1 TYPE workers gauge HELP workers Number of known workers workers{connected="1", busy="1", worker_type="external", architecture_amd64="1", architecture_i386="1", host_architecture="amd64"} 1 workers{connected="0", busy="0", worker_type="external", architecture_amd64="1", architecture_i386="1", host_architecture="amd64"} 5 workers{connected="1", busy="0", worker_type="external", architecture_arm64="1", architecture_armhf="1", architecture_armel="1", host_architecture="arm64"} 5 workers{connected="1", busy="0", worker_type="celery"} 1 Future work ----------- There are some more metrics we could include, but they get expensive to calculate without triggers and a summary table, or calculating a number on startup, and then maintaining running totals inside e.g. Redis. * ``artifact_file_size``: Gauge of the total size of files in artifacts in bytes. Labels: * ``category``: From ``Artifact.category``. * ``scope``: From ``Artifact.workspace.scope.name``. .. code-block:: sql SELECT SUM(size), category, db_scope.name AS scope FROM db_file INNER JOIN db_fileinartifact ON (db_file.id = db_fileinartifact.file_id) INNER JOIN db_artifact ON (db_fileinartifact.artifact_id = db_artifact.id) INNER JOIN db_workspace ON (db_workspace.id = db_artifact.workspace_id) INNER JOIN db_scope ON (db_scope.id = db_workspace.scope_id) GROUP BY category, scope ; ... (12 rows) Time: 1632.224 ms (00:01.632) Histograms ---------- There are a few more metrics that could make useful histograms. For now we leave these as potential future improvements. * ``task_execution_times``: extracted from our :ref:`runtime-statistics`. * ``artifact_file_counts``: number of files per artifact. * ``artifact_file_sizes``: file sizes for artifact files. * ``collection_sizes``: number of items in collections. * ``tokens_last_seen``: ``now - last_seen_at`` for enabled tokens.