Dynamic cloud compute scaling

Requirements

To support spikes in work requests, debusine needs to be able to dynamically make use of CPU resources in clouds. Static workers may still be useful for various reasons, such as:

  • locality

  • confidentiality

  • supporting an expected base load

  • customers who are sensitive to making use of artifacts built in clouds they don’t control

  • support for exotic architectures

  • in some cases, the ability to launch virtual machines in the worker

However, dynamic workers offer more flexibility and may be cheaper overall.

Dynamic workers typically have a cost based at least in part on their uptime, so debusine must keep track of them and stay within resource limits configured by the administrator and/or scope owners.

debusine must be able to provision dynamic workers automatically, including any additional facilities such as setting up Incus; and it must be able to tear down idle dynamic workers based on appropriate criteria. Some provisioning decisions may be handled outside debusine proper, such as by providing pre-built images.

Dynamic workers will have various properties that can be considered when scheduling work requests, such as their size. (Worker metadata already handles requirements such as checking whether workers support particular executor backends.)

Static workers should have higher priority for assigning work requests, so that the use of dynamic workers can be limited to load spikes.

The UI’s list of workers needs to indicate whether workers are static or dynamic, and allow inspecting the profiles of dynamic workers.

Dynamic worker names can be reused within certain parameters (e.g. provider and architecture), as long as multiple active workers don’t share a name. This avoids the list of historical workers growing unreasonably long just due to dynamic worker instances being created and destroyed.

Initially, this feature only needs to support Amazon EC2, but it must be easy to extend it to add support for other cloud providers.

Expected changes

Assets

To support debusine:cloud-provider-account assets, Asset.workspace needs to become nullable, and only required for certain asset categories (currently debusine:signing-key). There will be some associated refactoring for this: for instance, Asset.can_create, Asset.__str__, AssetSerializer, the client code for creating assets, and Playground.create_asset currently assume that every asset has a workspace.

Dynamic worker pools

debusine needs configuration for each cloud provider, such as API keys. This is conceptually similar to file stores, and it would make sense to handle it similarly. It also needs to keep track of pools of workers that are configured similarly, potentially including inactive ones. There may be multiple pools for a single cloud provider, using different accounts; for example, there may be pools on the same provider for different scopes with different billing arrangements.

Add a new WorkerPool model, including at least the following fields (all JSON objects are modelled using Pydantic):

  • name (string): the name of the pool

  • provider_account (foreign key to Asset): a debusine:cloud-provider-account asset with details of the provider account to use for this pool

  • enabled (boolean, defaults to True): if True, this pool is available for creating instances

  • architectures (array of strings): the task architectures supported by workers in this pool

  • tags (array of strings): the worker tags supported by workers in this pool (note that implementing worker pools does not require worker tags to be fully implemented yet)

  • specifications (JSON): public information indicating the type of instances to create, in a provider-dependent format; for some providers this may simply be instance type and image names, whereas for others it may include a collection of minimum values for parameters such as number of cores and RAM size

  • instance_wide (boolean, defaults to True): if True, this pool may be used by any scope on this debusine instance; if False, it may only be used by a single scope (i.e. there is a unique constraint on Scope/WorkerPool relations where WorkerPool.instance_wide is False)

  • ephemeral (boolean, defaults to False): if True, configure the worker to shut down and require reprovisioning after running a single work request

  • limits (JSON): instance limits, as follows:

    • max_active_instances (integer, optional): the maximum number of active instances in this pool

    • target_max_seconds_per_month (integer, optional): the maximum number of instance-seconds that should be used in this pool per month (this is a target maximum rather than a hard maximum, as debusine does not destroy instances that are running a task; it may be None if there is no need to impose such a limit)

    • max_idle_seconds (integer, defaults to 3600): destroy instances that have been idle for this long

Note

Public cloud providers typically have billing cycles corresponding to calendar months, which are not all the same length. debusine keeps track of instance run-times for each month in the Gregorian calendar in an attempt to approximate this. It will not always produce an exactly accurate prediction of run-time for the purpose of provider charges, since providers vary in terms of how they account for things like instances that run for less than an hour.

It is the administrator’s responsibility to calculate appropriate limits based on the provider’s advertised pricing for the configured instance type.

Note

To avoid wasting resources, debusine does not destroy instances that are actively running a task; this may cause it to overrun target_max_seconds_per_month. As a result, to keep resource usage under control even if some tasks take a very long time, administrators should normally also set max_active_instances, and should independently set up billing alerts with their cloud providers.

Note

max_idle_seconds has a conservative default to minimize accidental billing. Administrators should tune it in production taking the observed provisioning time into account, so that an acceptable fraction of instance run-time is spent actually running tasks.

Note

Although it isn’t initially required, there may be a “static” provider corresponding to manually-provisioned workers, allowing them to have the same kinds of flexible prioritization.

Worker names are constructed as f"{pool.name}-{instance.number:03d}", where instance numbers are allocated sequentially within the pool, and the lowest available instance number is used when creating a new instance.

Add an optional Worker.worker_pool foreign key.

Scope-level controls

Add a many-to-many Scope.worker_pools relationship pointing to WorkerPool, with extra data on the relationship as follows:

  • priority (integer): The priority of this worker pool for the purpose of scheduling work requests and creating dynamic workers to handle load spikes; pools with a higher priority will be selected in preference to pools with a lower priority. Workers that do not have a pool implicitly have a higher priority than any workers that have a pool.

  • limits (JSON): scope-level limits, as follows:

    • target_max_seconds_per_month (integer, optional): the maximum number of instance-seconds that should be used by work requests in this scope per month (this is a target maximum rather than a hard maximum, as debusine does not destroy instances that are running a task; it may be None if there is no need to impose such a limit; note that idle worker time is not accounted to any scope)

    • target_latency_seconds (integer, optional): a target for the number of seconds before the last pending work request in the relevant subset of the work queue is dispatched to a worker, which the provisioning service uses as a best-effort hint when scaling dynamic workers

Worker-level accounting

Add a Worker.instance_created_at field, which is set for dynamic workers when their instance is created. This is similar to Worker.registered_at, but that field indicates when the Worker row was created and remains constant across multiple create/destroy cycles, while Worker.instance_created_at can be used to determine the current runtime of an instance by subtracting it from the current time.

While RuntimeStatistics contains the runtime duration of each task and thus allows calculating how many seconds have been spent on behalf of each scope by each worker, doing so on the fly would involve a complex database query. To avoid that, add a Worker.durations_by_scope many-to-many field, with the accumulated duration for that worker and scope as extra data on the relationship. When the server is notified of a completed work request, it adds the work request’s duration to that field. When a dynamic worker is (re-)created, it sets that field to zero.

Workers do not currently notify the server when a task is aborted. They will need to start doing so, at least in order to send runtime statistics.

Image building

debusine workers on cloud providers need a base image. While this could be a generic image plus some dynamic provisioning code, it’s faster and more flexible to have pre-built images that already contain the worker code and only need to be given a token and a debusine server API URL.

The process of building and publishing these images should eventually be a debusine task, but to start with it can be an ad-hoc script. However, the code should still be in the debusine repository so that we can develop it along with the rest of the code.

Image builds will need at least the following options (which might be command-line options rather than this JSON-style design):

  • source (string, optional): a deb822-style APT source to add; for example, this would allow using the latest debusine-worker development build rather than the version of debusine-worker in the base image’s default repositories

  • enable_backends (list of strings, defaults to ["unshare"]): install and configure packages needed for the given executors

  • enable_tasks (list of strings, defaults to ["autopkgtest", "sbuild"]): install packages needed for the given tasks; most tasks do not need explicit support during provisioning, but autopkgtest, mmdebstrap, sbuild, and simplesystemimagebuild do

The initial image building code can be derived based on Freexian’s current Ansible setup for the debusine_worker role.

Provisioning

For each provider, debusine must have a backend that knows how to provision a new instance based on WorkerPool.specifications.

Provisioning must be non-interactive, so the provisioning code must provide enabled tokens to workers. It must be careful that tasks running on the worker cannot access the token (e.g. via cloud metadata endpoints) once the instance is up.

A new Celery service controls the provisioning process. (While this is somewhat related to the scheduler, it has very different performance characteristics - even in the best case, provisioning is typically much slower than scheduling work requests - and so it’s better to keep it separate.) That service periodically monitors the number of pending work requests per scope and decides whether to create new dynamic workers or destroy idle dynamic workers to meet demand. When doing so, it only considers the subset of pending work requests that would require the dynamic workers in question, taking into account worker tags and any restrictions declared by work requests: for example, if there are idle dynamic workers with a given tag and no pending work requests that require that tag, those workers can be destroyed.

The provisioning service must not create workers in a pool if WorkerPool.enabled is False, or if doing so would take it over any of the limits specified in WorkerPool.limits (considering Worker.instance_created_at) or ScopeWorkerPool.limits (considering Worker.durations_by_scope). It must destroy workers if they have been idle for longer than WorkerPool.limits.max_idle_seconds, or if they are idle and exceed any of the other limits in WorkerPool.limits.

ScopeWorkerPool.limits.target_latency_seconds acts as a best-effort hint for how aggressively to scale up workers. The provisioning service should aim to scale up the number of workers until its estimate of the time before the last pending work request is dispatched to a worker reaches that target for each scope where it is set, while not exceeding other limits. debusine does not guarantee to meet or even necessarily approach this target, but it allows administrators to tune how hard it should try. Task statistics will be required for good time estimates, but a first draft can use rough estimates such as the observed mean of work request durations regardless of subject or context.

The highest-priority pool may be unavailable for various reasons: there might be an outage, or the highest-priority pool might be a discounted provider option with low availability guarantees such as spot instances. The provisioning service should fall back to lower-priority pools as needed to satisfy the constraints above. If lower-priority pools are more expensive, then administrators can assign them a lower target_latency_seconds value so that debusine will not scale up workers as aggressively in those pools.

Todo

Because workers are only destroyed when idle, the provisioning service is in practice only able to scale down worker pools once the relevant part of the work queue has been exhausted. This means that the provisioning service should normally avoid being too aggressive when creating new dynamic workers.

Note

Some of this functionality overlaps with auto-scaling features that already exist in some cloud providers, and in principle it would be possible for debusine to provide metrics to those providers that would allow creating auto-scaling policies. However, we handle scaling ourselves instead because this allows us to use providers that don’t have that feature, and in order to allow scaling across multiple providers.

Scheduler

The scheduler currently checks worker metadata on a per-worker basis to decide whether a worker can run a given work request. This was already a potential optimization target, but it will need to be optimized to support deciding whether to provision dynamic workers, since the decision will need to be made in bulk for many workers at once. Instead of the current can_run_on hook that runs for a single work request, tasks will need some way to provide Django query conditions selecting the workers that can run them, relying on worker tags.

Once we have task statistics, we are likely to want to select workers (or worker pools) that have a certain minimum amount of disk space or memory. It may be sufficient to have a small/medium/big classification, but a clearer approach would be for some tags to have a numerical value so that work requests can indicate the minimum value they need. These would be a natural fit for worker pools: WorkerPool.specifications will usually already specify instance sizes in some way, and therefore WorkerPool.tags can also communicate those sizes to the scheduler.

User interface

Add a indication to /-/status/workers/ showing each worker’s pool.

Make each worker on /-/status/workers/ be a link to a view for that single worker. Where available, that view includes information from the corresponding WorkerPool, excluding secret details of the provider account.

Exclude inactive dynamic workers from /-/status/workers/, to avoid flooding users with irrelevant information.