OpenMetrics endpoint

Currently Debusine provides a JSON statistics endpoint at /api/1.0/service-status/. This can be used to monitor instance performance, via a custom scraper.

Prometheus allows simple automated scraping and visualization of service metrics, via a simple text format, which is being standardized as OpenMetrics.

Debusine should provide an OpenMetrics endpoint at /api/1.0/open-metrics/.

Security

The goal of the endpoint is to be public, and not leak anything private going on in private workspaces or scopes.

It will leak the type and quantity of tasks in queues, which could be used to determine something about what’s happening in private workspaces. The hope would be that if there’s enough happening on a debusine instance, the noise would drown out any ability to infer what is being executed from these metrics.

The endpoint is not scoped, as we’re interested in system-wide statistics. It would probably work perfectly as a scoped endpoint, so if a scoped token is supplied, it could filter to that scope. (This is not a requirement, it is out of scope.) Workers tend not to be scoped, so they’d have the same values across scopes.

The endpoint doesn’t need to require any auth, but if performance is an issue, it may be necessary to require auth, or to cache the response for a short amount of time (maybe 30 seconds).

Metrics

The endpoint should include these metrics:

  • artifacts: Gauge of the number of Artifacts. Labels:

    • category: From Artifact.category.

    • scope: From Artifact.workspace.scope.name

  • assets: Gauge of Assets. Labels:

    • category: From Asset.category.

    • scope: From Asset.workspace.scope.name

  • collections: Gauge of collections. Labels:

    • category: From Collection.category.

    • scope: From Collection.workspace.scope.name

  • file_store_size: Gauge of the total_size of File stores. Labels:

    • backend: From FileStore.backend.

    • name: From FileStore.name.

  • file_store_max_size: Gauge of the max_size of File stores. Labels:

    • backend: From FileStore.backend.

    • name: From FileStore.name.

  • groups: Gauge of the number of Groups. Labels:

    • ephemeral: 1 or 0, from Group.ephemeral.

  • tokens: Gauge of Tokens. Labels:

    • enabled: 1 or 0, from Token.enabled.

    • user: 1 if Token.user is not null, otherwise 0.

    • worker: 1 if Token.worker is not null, otherwise 0.

  • users: Gauge of the number of Users. Labels:

    • active: 1 or 0, from User.active.

  • user_activity: Histogram of the number of Users, bucketed by the number of days since they last created a workflow. Buckets:

    • <= 1

    • <= 3

    • <= 7

    • <= 14

    • <= 30

    • <= 90

    • <= 365

    • <= ∞

    Labels:

    • scope: The scope that the user created a workflow in.

  • user_identities: Gauge of the number of User SSO connections. Labels:

    • active: 1 or 0, from User.active.

    • issuer: From Identity.issuer.

  • user_identities_activity: Histogram of the number of Users, bucketed by the number of days since they last created a workflow. Buckets:

    • <= 1

    • <= 3

    • <= 7

    • <= 14

    • <= 30

    • <= 90

    • <= 365

    • <= ∞

    Labels:

    • issuer: From Identity.issuer.

    • scope: The scope that the user created a workflow in.

  • work_requests: Gauge of Work Requests. Labels:

    • task_type: From WorkRequest.task_type.

    • task_name: From WorkRequest.task_name.

    • scope: From WorkRequest.workspace.scope.name

    • status: From WorkRequest.status.

    • host_architecture: From WorkRequest.data.host_architecture (only for worker tasks).

    • backend: From WorkRequest.data.backend (only for worker tasks).

  • workers: Gauge of workers. Labels:

    • connected: 1 if Worker.connected_at is not NULL, otherwise 0.

    • busy: 1 if Worker.is_busy, otherwise 0.

    • worker_type: From Worker.worker_type.

    • worker_pool: From Worker.worker_pool (only for pool workers).

    • architecture_$ARCH: 1 if $ARCH is in Worker.metadata.system:architectures, otherwise 0 (only for external workers).

    • host_architecture: from Worker.metadata.system:host_architecture (only for external workers).

  • worker_pool_runtime: Gauge of total runtime of Dynamic Worker Pools. Labels:

    • worker_pool: From Worker.worker_pool (only for pool workers).

  • workflow_templates: Gauge of the number of WorkflowTemplate. Labels:

    • task_name: From WorkflowTemplate.task_name.

    • scope: From WorkflowTemplate.workspace.scope.name

  • workspaces: Gauge of workspaces. Labels:

    • private: 1 or 0, from Workspace.public.

    • expires: 1 if Workspace.expiration_delay is set, otherwise 0.

Example

The work request and worker metrics are a little more complex, so here’s an example of how they’d render:

TYPE work_requests gauge
HELP work_requests Number of known Work Requests
work_requests{task_type="Worker", task_name="sbuild", status="running", host_architecture="amd64", backend="auto", scope="debian"} 1
work_requests{task_type="Worker", task_name="sbuild", status="pending", host_architecture="amd64", backend="auto", scope="debian"} 3
work_requests{task_type="Worker", task_name="sbuild", status="pending", host_architecture="amd64", backend="auto", scope="another"} 2
work_requests{task_type="Worker", task_name="sbuild", status="pending", host_architecture="amd64", backend="unshare", scope="debian"} 2
work_requests{task_type="Worker", task_name="sbuild", status="pending", host_architecture="i386", backend="auto", scope="debian"} 1
work_requests{task_type="Workflow", task_name="sbuild", status="running", scope="debian"} 1

TYPE workers gauge
HELP workers Number of known workers
workers{connected="1", busy="1", worker_type="external", architecture_amd64="1", architecture_i386="1", host_architecture="amd64"} 1
workers{connected="0", busy="0", worker_type="external", architecture_amd64="1", architecture_i386="1", host_architecture="amd64"} 5
workers{connected="1", busy="0", worker_type="external", architecture_arm64="1", architecture_armhf="1", architecture_armel="1", host_architecture="arm64"} 5
workers{connected="1", busy="0", worker_type="celery"} 1

Future work

There are some more metrics we could include, but they get expensive to calculate without triggers and a summary table, or calculating a number on startup, and then maintaining running totals inside e.g. Redis.

  • artifact_file_size: Gauge of the total size of files in artifacts in bytes. Labels:

    • category: From Artifact.category.

    • scope: From Artifact.workspace.scope.name.

SELECT
  SUM(size),
  category,
  db_scope.name AS scope
FROM db_file
INNER JOIN db_fileinartifact ON (db_file.id = db_fileinartifact.file_id)
INNER JOIN db_artifact ON (db_fileinartifact.artifact_id = db_artifact.id)
INNER JOIN db_workspace ON (db_workspace.id = db_artifact.workspace_id)
INNER JOIN db_scope ON (db_scope.id = db_workspace.scope_id)
GROUP BY
  category,
  scope
;
...

(12 rows)
Time: 1632.224 ms (00:01.632)

Histograms

There are a few more metrics that could make useful histograms. For now we leave these as potential future improvements.

  • task_execution_times: extracted from our RuntimeStatistics model.

  • artifact_file_counts: number of files per artifact.

  • artifact_file_sizes: file sizes for artifact files.

  • collection_sizes: number of items in collections.

  • tokens_last_seen: now - last_seen_at for enabled tokens.