Dynamic cloud compute scaling
Requirements
To support spikes in work requests, debusine needs to be able to dynamically make use of CPU resources in clouds. Static workers may still be useful for various reasons, such as:
locality
confidentiality
supporting an expected base load
customers who are sensitive to making use of artifacts built in clouds they don’t control
support for exotic architectures
in some cases, the ability to launch virtual machines in the worker
However, dynamic workers offer more flexibility and may be cheaper overall.
Dynamic workers typically have a cost based at least in part on their uptime, so debusine must keep track of them and stay within resource limits configured by the administrator and/or scope owners.
debusine must be able to provision dynamic workers automatically, including any additional facilities such as setting up Incus; and it must be able to tear down idle dynamic workers based on appropriate criteria. Some provisioning decisions may be handled outside debusine proper, such as by providing pre-built images.
Dynamic workers will have various properties that can be considered when scheduling work requests, such as their size. (Worker metadata already handles requirements such as checking whether workers support particular executor backends.)
Static workers should have higher priority for assigning work requests, so that the use of dynamic workers can be limited to load spikes.
The UI’s list of workers needs to indicate whether workers are static or dynamic, and allow inspecting the profiles of dynamic workers.
Dynamic worker names can be reused within certain parameters (e.g. provider and architecture), as long as multiple active workers don’t share a name. This avoids the list of historical workers growing unreasonably long just due to dynamic worker instances being created and destroyed.
Initially, this feature only needs to support Amazon EC2, but it must be easy to extend it to add support for other cloud providers.
Expected changes
Assets
To support debusine:cloud-provider-account assets, Asset.workspace
needs to become
nullable, and only required for certain asset categories (currently
debusine:signing-key
). There will be some associated refactoring for
this: for instance, Asset.can_create
, Asset.__str__
,
AssetSerializer
, the client code for creating assets, and
Playground.create_asset
currently assume that every asset has a
workspace.
Dynamic worker pools
debusine needs configuration for each cloud provider, such as API keys. This is conceptually similar to file stores, and it would make sense to handle it similarly. It also needs to keep track of pools of workers that are configured similarly, potentially including inactive ones. There may be multiple pools for a single cloud provider, using different accounts; for example, there may be pools on the same provider for different scopes with different billing arrangements.
Add a new WorkerPool
model, including at least the following fields (all
JSON objects are modelled using Pydantic):
name
(string): the name of the poolprovider_account
(foreign key toAsset
): a debusine:cloud-provider-account asset with details of the provider account to use for this poolenabled
(boolean, defaults to True): if True, this pool is available for creating instancesarchitectures
(array of strings): the task architectures supported by workers in this pooltags
(array of strings): the worker tags supported by workers in this pool (note that implementing worker pools does not require worker tags to be fully implemented yet)specifications
(JSON): public information indicating the type of instances to create, in a provider-dependent format; for some providers this may simply be instance type and image names, whereas for others it may include a collection of minimum values for parameters such as number of cores and RAM sizeinstance_wide
(boolean, defaults to True): if True, this pool may be used by any scope on this debusine instance; if False, it may only be used by a single scope (i.e. there is a unique constraint onScope
/WorkerPool
relations whereWorkerPool.instance_wide
is False)ephemeral
(boolean, defaults to False): if True, configure the worker to shut down and require reprovisioning after running a single work requestlimits
(JSON): instance limits, as follows:max_active_instances
(integer, optional): the maximum number of active instances in this pooltarget_max_seconds_per_month
(integer, optional): the maximum number of instance-seconds that should be used in this pool per month (this is a target maximum rather than a hard maximum, as debusine does not destroy instances that are running a task; it may be None if there is no need to impose such a limit)max_idle_seconds
(integer, defaults to 3600): destroy instances that have been idle for this long
Note
Public cloud providers typically have billing cycles corresponding to calendar months, which are not all the same length. debusine keeps track of instance run-times for each month in the Gregorian calendar in an attempt to approximate this. It will not always produce an exactly accurate prediction of run-time for the purpose of provider charges, since providers vary in terms of how they account for things like instances that run for less than an hour.
It is the administrator’s responsibility to calculate appropriate limits based on the provider’s advertised pricing for the configured instance type.
Note
To avoid wasting resources, debusine does not destroy instances that are
actively running a task; this may cause it to overrun
target_max_seconds_per_month
. As a result, to keep resource usage
under control even if some tasks take a very long time, administrators
should normally also set max_active_instances
, and should
independently set up billing alerts with their cloud providers.
Note
max_idle_seconds
has a conservative default to minimize accidental
billing. Administrators should tune it in production taking the
observed provisioning time into account, so that an acceptable fraction
of instance run-time is spent actually running tasks.
Note
Although it isn’t initially required, there may be a “static” provider corresponding to manually-provisioned workers, allowing them to have the same kinds of flexible prioritization.
Worker names are constructed as f"{pool.name}-{instance.number:03d}"
,
where instance numbers are allocated sequentially within the pool, and the
lowest available instance number is used when creating a new instance.
Add an optional Worker.worker_pool
foreign key.
Scope-level controls
Add a many-to-many Scope.worker_pools
relationship pointing to
WorkerPool
, with extra data on the relationship as follows:
priority
(integer): The priority of this worker pool for the purpose of scheduling work requests and creating dynamic workers to handle load spikes; pools with a higher priority will be selected in preference to pools with a lower priority. Workers that do not have a pool implicitly have a higher priority than any workers that have a pool.limits
(JSON): scope-level limits, as follows:target_max_seconds_per_month
(integer, optional): the maximum number of instance-seconds that should be used by work requests in this scope per month (this is a target maximum rather than a hard maximum, as debusine does not destroy instances that are running a task; it may be None if there is no need to impose such a limit; note that idle worker time is not accounted to any scope)target_latency_seconds
(integer, optional): a target for the number of seconds before the last pending work request in the relevant subset of the work queue is dispatched to a worker, which the provisioning service uses as a best-effort hint when scaling dynamic workers
Worker-level accounting
Add a Worker.instance_created_at
field, which is set for dynamic workers
when their instance is created. This is similar to
Worker.registered_at
, but that field indicates when the Worker
row
was created and remains constant across multiple create/destroy cycles,
while Worker.instance_created_at
can be used to determine the current
runtime of an instance by subtracting it from the current time.
While RuntimeStatistics contains the runtime
duration of each task and thus allows calculating how many seconds have been
spent on behalf of each scope by each worker, doing so on the fly would
involve a complex database query. To avoid that, add a
Worker.durations_by_scope
many-to-many field, with the accumulated
duration for that worker and scope as extra data on the relationship. When
the server is notified of a completed work request, it adds the work
request’s duration to that field. When a dynamic worker is (re-)created, it
sets that field to zero.
Workers do not currently notify the server when a task is aborted. They will need to start doing so, at least in order to send runtime statistics.
Image building
debusine workers on cloud providers need a base image. While this could be a generic image plus some dynamic provisioning code, it’s faster and more flexible to have pre-built images that already contain the worker code and only need to be given a token and a debusine server API URL.
The process of building and publishing these images should eventually be a debusine task, but to start with it can be an ad-hoc script. However, the code should still be in the debusine repository so that we can develop it along with the rest of the code.
Image builds will need at least the following options (which might be command-line options rather than this JSON-style design):
source
(string, optional): a deb822-style APT source to add; for example, this would allow using the latest debusine-worker development build rather than the version ofdebusine-worker
in the base image’s default repositoriesenable_backends
(list of strings, defaults to["unshare"]
): install and configure packages needed for the given executorsenable_tasks
(list of strings, defaults to["autopkgtest", "sbuild"]
): install packages needed for the given tasks; most tasks do not need explicit support during provisioning, butautopkgtest
,mmdebstrap
,sbuild
, andsimplesystemimagebuild
do
The initial image building code can be derived based on Freexian’s current
Ansible setup for the debusine_worker
role.
Provisioning
For each provider, debusine must have a backend that knows how to provision
a new instance based on WorkerPool.specifications
.
Provisioning must be non-interactive, so the provisioning code must provide enabled tokens to workers. It must be careful that tasks running on the worker cannot access the token (e.g. via cloud metadata endpoints) once the instance is up.
A new Celery service controls the provisioning process. (While this is somewhat related to the scheduler, it has very different performance characteristics - even in the best case, provisioning is typically much slower than scheduling work requests - and so it’s better to keep it separate.) That service periodically monitors the number of pending work requests per scope and decides whether to create new dynamic workers or destroy idle dynamic workers to meet demand. When doing so, it only considers the subset of pending work requests that would require the dynamic workers in question, taking into account worker tags and any restrictions declared by work requests: for example, if there are idle dynamic workers with a given tag and no pending work requests that require that tag, those workers can be destroyed.
The provisioning service must not create workers in a pool if
WorkerPool.enabled
is False, or if doing so would take it over any of
the limits specified in WorkerPool.limits
(considering
Worker.instance_created_at
) or ScopeWorkerPool.limits
(considering
Worker.durations_by_scope
). It must destroy workers if they have been
idle for longer than WorkerPool.limits.max_idle_seconds
, or if they are
idle and exceed any of the other limits in WorkerPool.limits
.
ScopeWorkerPool.limits.target_latency_seconds
acts as a best-effort hint
for how aggressively to scale up workers. The provisioning service should
aim to scale up the number of workers until its estimate of the time before
the last pending work request is dispatched to a worker reaches that target
for each scope where it is set, while not exceeding other limits. debusine
does not guarantee to meet or even necessarily approach this target, but it
allows administrators to tune how hard it should try. Task statistics will be required for good time estimates, but a first
draft can use rough estimates such as the observed mean of work request
durations regardless of subject or context.
The highest-priority pool may be unavailable for various reasons: there
might be an outage, or the highest-priority pool might be a discounted
provider option with low availability guarantees such as spot instances.
The provisioning service should fall back to lower-priority pools as needed
to satisfy the constraints above. If lower-priority pools are more
expensive, then administrators can assign them a lower
target_latency_seconds
value so that debusine will not scale up workers
as aggressively in those pools.
Todo
Because workers are only destroyed when idle, the provisioning service is in practice only able to scale down worker pools once the relevant part of the work queue has been exhausted. This means that the provisioning service should normally avoid being too aggressive when creating new dynamic workers.
Note
Some of this functionality overlaps with auto-scaling features that already exist in some cloud providers, and in principle it would be possible for debusine to provide metrics to those providers that would allow creating auto-scaling policies. However, we handle scaling ourselves instead because this allows us to use providers that don’t have that feature, and in order to allow scaling across multiple providers.
Scheduler
The scheduler currently checks worker metadata on a per-worker basis to
decide whether a worker can run a given work request. This was already a
potential optimization target, but it will need to be optimized to support
deciding whether to provision dynamic workers, since the decision will need
to be made in bulk for many workers at once. Instead of the current
can_run_on
hook that runs for a single work request, tasks will need
some way to provide Django query conditions selecting the workers that can
run them, relying on worker tags.
Once we have task statistics, we are likely to want to select workers (or
worker pools) that have a certain minimum amount of disk space or memory.
It may be sufficient to have a small/medium/big classification, but a
clearer approach would be for some tags to have a numerical value so that
work requests can indicate the minimum value they need. These would be a
natural fit for worker pools: WorkerPool.specifications
will usually
already specify instance sizes in some way, and therefore
WorkerPool.tags
can also communicate those sizes to the scheduler.
User interface
Add a indication to /-/status/workers/
showing each worker’s pool.
Make each worker on /-/status/workers/
be a link to a view for that
single worker. Where available, that view includes information from the
corresponding WorkerPool
, excluding secret details of the provider
account.
Exclude inactive dynamic workers from /-/status/workers/
, to avoid
flooding users with irrelevant information.