OpenMetrics endpoint
Currently Debusine provides a JSON statistics endpoint at
/api/1.0/service-status/
.
This can be used to monitor instance performance, via a custom scraper.
Prometheus allows simple automated scraping and visualization of service metrics, via a simple text format, which is being standardized as OpenMetrics.
Debusine should provide an OpenMetrics endpoint at
/api/1.0/open-metrics/
.
Security
The goal of the endpoint is to be public, and not leak anything private going on in private workspaces or scopes.
It will leak the type and quantity of tasks in queues, which could be used to determine something about what’s happening in private workspaces. The hope would be that if there’s enough happening on a debusine instance, the noise would drown out any ability to infer what is being executed from these metrics.
The endpoint is not scoped, as we’re interested in system-wide statistics. It would probably work perfectly as a scoped endpoint, so if a scoped token is supplied, it could filter to that scope. (This is not a requirement, it is out of scope.) Workers tend not to be scoped, so they’d have the same values across scopes.
The endpoint doesn’t need to require any auth, but if performance is an issue, it may be necessary to require auth, or to cache the response for a short amount of time (maybe 30 seconds).
Metrics
The endpoint should include these metrics:
artifacts
: Gauge of the number of Artifacts. Labels:category
: FromArtifact.category
.scope
: FromArtifact.workspace.scope.name
assets
: Gauge of Assets. Labels:category
: FromAsset.category
.scope
: FromAsset.workspace.scope.name
collections
: Gauge of collections. Labels:category
: FromCollection.category
.scope
: FromCollection.workspace.scope.name
file_store_size
: Gauge of thetotal_size
of File stores. Labels:backend
: FromFileStore.backend
.name
: FromFileStore.name
.
file_store_max_size
: Gauge of themax_size
of File stores. Labels:backend
: FromFileStore.backend
.name
: FromFileStore.name
.
groups
: Gauge of the number of Groups. Labels:ephemeral
:1
or0
, fromGroup.ephemeral
.
tokens
: Gauge of Tokens. Labels:enabled
:1
or0
, fromToken.enabled
.user
:1
ifToken.user
is not null, otherwise0
.worker
:1
ifToken.worker
is not null, otherwise0
.
users
: Gauge of the number of Users. Labels:active
:1
or0
, fromUser.active
.
user_activity
: Histogram of the number of Users, bucketed by the number of days since they last created a workflow. Buckets:<= 1
<= 3
<= 7
<= 14
<= 30
<= 90
<= 365
<= ∞
Labels:
scope
: The scope that the user created a workflow in.
user_identities
: Gauge of the number of User SSO connections. Labels:active
:1
or0
, fromUser.active
.issuer
: FromIdentity.issuer
.
user_identities_activity
: Histogram of the number of Users, bucketed by the number of days since they last created a workflow. Buckets:<= 1
<= 3
<= 7
<= 14
<= 30
<= 90
<= 365
<= ∞
Labels:
issuer
: FromIdentity.issuer
.scope
: The scope that the user created a workflow in.
work_requests
: Gauge of Work Requests. Labels:task_type
: FromWorkRequest.task_type
.task_name
: FromWorkRequest.task_name
.scope
: FromWorkRequest.workspace.scope.name
status
: FromWorkRequest.status
.host_architecture
: FromWorkRequest.data.host_architecture
(only for worker tasks).backend
: FromWorkRequest.data.backend
(only for worker tasks).
workers
: Gauge of workers. Labels:connected
:1
ifWorker.connected_at
is not NULL, otherwise0
.busy
:1
ifWorker.is_busy
, otherwise0
.worker_type
: FromWorker.worker_type
.worker_pool
: FromWorker.worker_pool
(only for pool workers).architecture_$ARCH
:1
if$ARCH
is inWorker.metadata.system:architectures
, otherwise0
(only for external workers).host_architecture
: fromWorker.metadata.system:host_architecture
(only for external workers).
worker_pool_runtime
: Gauge of total runtime of Dynamic Worker Pools. Labels:worker_pool
: FromWorker.worker_pool
(only for pool workers).
workflow_templates
: Gauge of the number of WorkflowTemplate. Labels:task_name
: FromWorkflowTemplate.task_name
.scope
: FromWorkflowTemplate.workspace.scope.name
workspaces
: Gauge of workspaces. Labels:private
:1
or0
, fromWorkspace.public
.expires
:1
ifWorkspace.expiration_delay
is set, otherwise0
.
Example
The work request and worker metrics are a little more complex, so here’s an example of how they’d render:
TYPE work_requests gauge
HELP work_requests Number of known Work Requests
work_requests{task_type="Worker", task_name="sbuild", status="running", host_architecture="amd64", backend="auto", scope="debian"} 1
work_requests{task_type="Worker", task_name="sbuild", status="pending", host_architecture="amd64", backend="auto", scope="debian"} 3
work_requests{task_type="Worker", task_name="sbuild", status="pending", host_architecture="amd64", backend="auto", scope="another"} 2
work_requests{task_type="Worker", task_name="sbuild", status="pending", host_architecture="amd64", backend="unshare", scope="debian"} 2
work_requests{task_type="Worker", task_name="sbuild", status="pending", host_architecture="i386", backend="auto", scope="debian"} 1
work_requests{task_type="Workflow", task_name="sbuild", status="running", scope="debian"} 1
TYPE workers gauge
HELP workers Number of known workers
workers{connected="1", busy="1", worker_type="external", architecture_amd64="1", architecture_i386="1", host_architecture="amd64"} 1
workers{connected="0", busy="0", worker_type="external", architecture_amd64="1", architecture_i386="1", host_architecture="amd64"} 5
workers{connected="1", busy="0", worker_type="external", architecture_arm64="1", architecture_armhf="1", architecture_armel="1", host_architecture="arm64"} 5
workers{connected="1", busy="0", worker_type="celery"} 1
Future work
There are some more metrics we could include, but they get expensive to calculate without triggers and a summary table, or calculating a number on startup, and then maintaining running totals inside e.g. Redis.
artifact_file_size
: Gauge of the total size of files in artifacts in bytes. Labels:category
: FromArtifact.category
.scope
: FromArtifact.workspace.scope.name
.
SELECT
SUM(size),
category,
db_scope.name AS scope
FROM db_file
INNER JOIN db_fileinartifact ON (db_file.id = db_fileinartifact.file_id)
INNER JOIN db_artifact ON (db_fileinartifact.artifact_id = db_artifact.id)
INNER JOIN db_workspace ON (db_workspace.id = db_artifact.workspace_id)
INNER JOIN db_scope ON (db_scope.id = db_workspace.scope_id)
GROUP BY
category,
scope
;
...
(12 rows)
Time: 1632.224 ms (00:01.632)
Histograms
There are a few more metrics that could make useful histograms. For now we leave these as potential future improvements.
task_execution_times
: extracted from our RuntimeStatistics model.artifact_file_counts
: number of files per artifact.artifact_file_sizes
: file sizes for artifact files.collection_sizes
: number of items in collections.tokens_last_seen
:now - last_seen_at
for enabled tokens.