Detecting regressions

Goals and requirements

When preparing updates, it is part of best practices to ensure that the update does not introduce any regression on all the QA checks. The concept of regression is thus to be managed at the level of the QA workflow.

Some regressions can be identified at the level of the result of the respective QA tasks. An autopkgtest that used to pass, and now fails, is an obvious example. However, when both are failing, you might want to dig deeper and see if the same set of tests are failing. If we can identify new tests that are failing, we still have a regression.

On the opposite, a lintian analysis might indicate a success on both versions, but generate new warning-level tags, that we might want to report as a regression.

Other considerations:

We want the QA workflow to have a summary view showing side-by-side the result of the “QA tasks” on the original and updated package(s) along with a conclusion on whether it’s a regression or not. It should be possible to have detailed view of a specific comparison, showing a comparison of the artifacts generated by the QA tasks.
We want that table to be regularly updated, every time that a QA task finishes, without waiting for the completion of all QA tasks.
We want to be able to configure the debian_pipeline workflow so that any failure or any regression in the qa workflow requires a manual confirmation to continue the root workflow. (Right now the result of the qa workflow has no impact on the package_upload workflow for example)

debian:qa-results

About the `debian:qa-results` collection category

This collection can host all the QA results for all the packages in a given suite. We thus have different kind of items for each QA task.

Common features across all QA results

The name of the collection item is always {task_name}_{package}_{version}_{architecture}_{work_request_id} substituting values from the per-item data described below.

Variables when adding items: see “Per-item data” below
Data:
- suite_collection (Single lookup, required): the debian:suite collection for which the collection is storing the QA results.
- old_items_to_keep: number of old items to keep. Defaults to 5. For each task/package/architecture combination, the collection always keeps the given number of most recent entries (based on the timestamp field). The cleanup is automatically done when adding new items.
  
  Note
  
  At some point, we may need more advanced logic than this, for instance to clean up QA results about packages that are gone from the corresponding suite.
Per-item data:
- task_name (inferred from work request when adding item): the name of the task that generated the QA result (so autopkgtest, lintian, piuparts, in the future also blhc)
- package: the name of the source package being analyzed, or the source package from which the binary package being analyzed was built
- version: the version of the source package being analyzed, or the source package from which the binary package being analyzed was built
- architecture: source for a source analysis, or the appropriate architecture name for a binary analysis
- timestamp: a Unix timestamp (cf. date +%s) used to estimate the freshness of the QA result, can often be usefully tied to the Date field of the package repository in which the analysis was performed.
- work_request_id: the ID of the WorkRequest that generated the QA result
- result (inferred from work request when adding item): duplicates the string value of the result field of the associated WorkRequest
Lookup names:
- latest:TASKNAME_PACKAGE_ARCHITECTURE: the latest QA result for tasks named TASKNAME for the source package named PACKAGE on ARCHITECTURE.

Specification for `lintian`

The collection item is a debian:lintian artifact. The collection contains items for the source and for all architectures.

A lintian analysis is outdated if:

either the underlying source or binary packages are outdated (i.e. have different version numbers) compared to what’s available in the debian:suite collection
or the lintian version used to perform the analysis is older than the version available in the debian:suite collection

Specification for `autopkgtest`

The collection item is a debian:autopkgtest artifact. The collection contains items for all architectures (but not for the source).

An autopkgtest analysis is outdated if:

either the underlying source or binary packages are outdated (i.e. have different version numbers) compared to what’s available in the debian:suite collection
or the timestamp of the analysis is older than 30 days compared to the Date timestamp of the debian:suite collection

Specification for `piuparts`

The collection item is a bare data item of category debian:qa-result with all the common per-item data described above. The collection contains items for all architectures (but not for the source).

A piuparts analysis is outdated if the underlying binary packages are outdated (i.e. have different version numbers) compared to what’s available in the debian:suite collection.

Note

The lack of piuparts artifact means that we don’t have any information about the binary packages that were analyzed except if we lookup the details of the WorkRequest. That’s probably going too far so instead we will likely compare based on the source version documented in the per-item data.

Filed #805 to think about introducing a proper artifact at some point.

Specification for `blhc`

The collection item is a debian:blhc artifact. The collection contains items for all architectures (but not for the source).

A blhc analysis is outdated if the underlying source package is outdated (i.e. has a smaller version number) compared to what’s available in the debian:suite collection. The comparison needs to be performed based on the metadata of the linked debian:package-build-log artifact.

Specification for `debdiff`

The DebDiff QA task does not contribute any item to the debian:qa-results because it does not provide any validation of a single target artifact.

By its nature, the task already performs a comparison between the original version and the new version. And the result of the comparison can’t easily be used to draw any conclusion about the update, it is up to human reviewers to decide of that.

Implementation plan

Changes to the `debian_pipeline` workflow

The workflow is expanded with new parameters:

enable_regression_tracking (boolean, defaults to False): configure the QA workflow to detect and display regressions in QA results.
regression_tracking_qa_results (Single lookup, required if enable_regression_tracking is True): the debian:qa-results collection that contains the reference results of QA tasks to use to detect regressions.
qa_failure_policy (string, default value ignore). The policy to apply when the qa workflow failed. Allowed values are ignore, fail, confirm.

The parameter reverse_dependencies_autopkgtest_suite is renamed into qa_suite and becomes required if either enable_regression_tracking or enable_reverse_dependencies_autopkgtest is True. It refers to a debian:suite collection which can be used to schedule reverse dependencies autopkgtest and/or generate the reference QA results that are needed to detect regressions.

When `enable_regression_tracking` is set

the normal qa workflow is run with:
- enable_regression_tracking set to True
- reference_qa_results set with the value of regression_tracking_qa_results
- reference_prefix set to reference-qa-result|
an extra qa workflow is run with:
- enable_regression_tracking set to False
- reference_qa_results set with the value of regression_tracking_qa_results
- update_qa_results set to True
- prefix set to reference-qa-result|
- source_artifact and binary_artifacts pointing to the corresponding artifacts in the debian:suite collection listed by qa_suite
- fail_on set to never

Handling of `qa_failure_policy`

The current behaviour is ignore. The workflow is scheduled but nothing depends on it.

With the fail policy, the remaining workflows (or at least the next workflow in the dependency chain) gain dependencies on the qa workflow so that they are not executed before completion of the qa workflow and are effectively aborted if anything fails.

With the confirm policy, the qa workflow is scheduled with workflow_data.allow_failure: true and a new Confirm WAIT task is scheduled in between the qa workflow and the remaining parts of the workflow. That confirm task has the following task data:

auto_confirm_if_no_failure: true
confirm_label: "Continue the workflow despite QA failures"

Changes to the `qa` workflow

The workflow is modified to also accept multiple debian:binary-package artifacts as input in binary_artifacts. This will require ensuring that the sub-workflows are able to accept a similar input.

The workflow is expanded with new parameters:

enable_regression_tracking (defaults to False): configure the QA workflow to detect and display regressions in QA results.
reference_qa_results (Single lookup, optional): the collection of category debian:qa-results that contains the reference results of QA tasks to use to detect regressions.
update_qa_results (boolean, defaults to False): when set to True, the workflow runs QA tasks and updates the collection passed in reference_qa_results with the results.
prefix (string, optional): prefix this string to the item names provided in the internal collection
reference_prefix (string, optional unless enable_regression_tracking is True)
fail_on (string, optional): indicate the conditions to trigger a failure of the whole workflow. Allowed values are failure, regression, never. With failure, the workflow is marked as failed if one of the QA task fails. With regression, the workflow fails only if one of the QA result is a regression compared to the former result. With never, the workflow always succeeds. The default value is regression if enable_regression_tracking is True, otherwise it is failure.

Behavior with `update_qa_results` set to True

When update_qa_results is set to True, the goal of the workflow is modified: its only purpose is to provide reference results to be stored in a debian:qa-results collection. Task failures are never fatal for the parent workflow or for dependent tasks.

During orchestration, the workflow compares the data available in the debian:qa-results collection together with information about the submitted source_artifact and binary_artifacts.

When a missing or outdated QA result is detected, it schedules the appropriate QA task, and it creates a corresponding promise in the internal collection (the name of the promise is the prefix followed by the expected name of the collection entry). The QA task has the following event reactions:

on_assignment: an action to skip the work request if the latest relevant item in the debian:qa-results collection has changed since the work request has created; this avoids wasting resources if multiple parallel workflows trigger an update of the same QA results
on_success: an action to add the result to the debian:qa-results collection
on_failure: same as on_success

Note that when enable_reverse_dependencies_autopkgtest is set to True, it must also update the autopkgtest results of the reverse dependencies and thus compute the same list of packages as the reverse_dependencies_autopkgtest workflow (using the same qa_suite collection).

Behavior with `enable_regression_tracking` set to True

When enable_regression_tracking is set to True, the orchestrator of the qa workflow schedules workflow callbacks that will perform the regression analysis. In order to wait for the availability of the QA result(s), those callbacks have dependencies against:

the promises associated to the QA result(s) that are required from the additional qa workflow building reference results
the promises associated to the QA result(s) that are required from the sub-workflows

The workflow_data field for those workflow callbacks have:

visible set to False so that they do not show up in the workflow hierarchy (new feature to implement)
step set to regression-analysis

As part of the callback, the analysis is performed and the result of the analysis is stored in the output_data field of the workflow.

Note

We use simple workflow callbacks instead of full-fledged worker tasks or server tasks because we assume that regression analysis can be completed just by comparing the artifact metadata and/or the collection item. Workflow callbacks are already dealt through celery tasks so they are relatively cheap. Note however that the large number of callbacks requires use of careful locking to serialize the operations between concurrent runs trying to update the same workflow.

Handling of `fail_on`

With fail_on: never or fail_on: regression, all the sub-workflows are run with workflow_data.allow_failure: true.

With fail_on: regression, a final orchestrator callback is scheduled:

it depends on all the regression-analysis callbacks
workflow_data.visible is set to True
workflow_data.step is final-regression-analysis
workflow_data.display_name is Regression analysis

The callback reviews the data in output_data.regression_analysis and sets its own result to FAILURE in case of regression, or SUCCESS otherwise.

Note

The assumption is that the result of this callback will bubble up to the qa workflow since the callback itself should be configured with workflow_data.allow_failure: false.

If that assumption does not hold true, then we might want to simply set the result of the workflow since we know (through dependencies) that we are the last work request in the workflow so we should be able to just mark it as completed too!

Various removals

The debian:suite-lintian collection is removed since the debian:qa-results collection is effectively a superset of that collection.

Therefore the UpdateDerivedCollection and UpdateSuiteLintianCollection server tasks are removed too.

Implementation of QA results regression analysis

The output_data of a qa workflow has a new regression_analysis key which is a dictionary of such analysis. The key represents the name of a test (e.g. autopkgtest:dpkg:amd64) without any version and the value is the result of the analysis which is defined as another dictionary with the following keys:

original_url (optional, can be set later when the QA result is available): URL pointing to the original artifact or bare-data collection item used for the comparison
new_url (optional, can be set later when the QA result is available): URL pointing to the new artifact or bare-data collection item used for the comparison
status (required): a string value among the following values:
- no-result: when the comparison has not been completed yet (usually because we lack one of the two required QA results)
- error: when the comparison (or one of the required QA tasks) errored out
- improvement: when the new QA result is better than the original QA result
- stable: when the new QA result is neither better nor worse than the original QA result
- regression: when the new QA result is worse than the original QA result
details (optional): an arbitrarily nested data-structure composed of lists and dictionaries where dictionary keys and leaf items (and/or leaf item values) are always strings. Expectation is that this structure is rendered as nested lists shown behind a collapsible section that can be unfolded to learn more about the analysis. The strings are HTML-escaped when rendered.

The regression analysis can lead to multiple results:

no-result: when we are lacking one of the QA results for the comparison
improvement: when the new QA result is “success” while the reference one is “failure”
stable: when the two QA results are “success”
regression: when the new QA result is “failure” while the reference one is “success”
error: when one of the work requests providing the required QA result errored out

Details can also be provided as output of the analysis, they will typically be displayed in the summary view of the qa workflow.

The first level of comparison is at the level of the result of the WorkRequest, following the logic above. But depending on the output of the QA task, it is possible to have a finer-grained analysis. The next sections details how those deeper comparisons are performed.

For lintian

We compare the summary.tags_count_by_severity to determine the status of the regression analysis:

SEVERITIES = ("warning", "error")
if any(new_count[s] > original_count[s] for s in SEVERITIES):
    return "regression"
elif any(new_count[s] < original_count[s] for s in SEVERITIES):
    return "improvement"
else
    return "stable"

We also perform a comparison of the summary.tags_found to indicate in the details field which new tags have been reported, and which tags have disappeared.

Note

Among the difference of tags, there can be tags that have severities lower than warning and error, but we have no way to filter them out without loading the full analysis.json from the artifact which would be much more costly for almost no gain.

For autopkgtest

We compare the result of each individual test in the results key of the artifact metadata. Each result is classified on its own following the table below, the first line that matches ends the classification process:

ORIGINAL	NEW	RESULT
`*`	FLAKY	stable
PASS, SKIP	FAIL	regression
FAIL, FLAKY	PASS, SKIP	improvement
`*`	`*`	stable

Each individual regression or improvement is noted and documented in the details field of the analysis.

To compute the global result of the regression analysis, the logic is the following:

if "regression" in comparison_of_tests:
    return "regression"
elif "improvement" in comparison_of_tests:
    return "improvement"
else:
    return "stable"

For piuparts and blhc

The provided metadata do not allow for deep comparisons, so the comparison is based on the result of the corresponding WorkRequest (which is duplicated in the per-item data of the debian:qa-results collection).

The algorithm is the following:

if origin.result == SUCCESS and new.result == FAILURE:
    return "regression"
elif origin.result == FAILURE and new.result == SUCCESS:
    return "improvement"
else
    return "stable"

About the UI to display regression analysis

Here’s an example of what the table could look like:

Test name	Original result for dpkg_1.2.0	New result for dpkg_1.2.1	Conclusion
autopkgtest:dpkg_amd64	✅	❌	↘️ regression
lintian:dpkg_source	✅	✅	➡️ stable
piuparts:dpkg_amd64	✅	❔	❔ no-result
autopkgtest:apt_amd64	❌	✅	↗️ improvement
Summary	1 failure	1 failure, 1 missing result	↘️ regression

Multiple comments about the desired table:

We should use the standard WorkRequest result widgets instead of the special characters (✅ and ❌) shown above.
We want to put links to the artifact for each QA result in the “Original result” and “New result” columns.
The number of autopkgtest results due to the reverse_dependencies_autopkgtest workflow can be overwhelming. Due to this, the autopkgtest lines that concern other source packages than the one processed in the current workflow are hidden if the regression analysis result is “stable” or “no-result”.
For piuparts tasks where we don’t have artifacts to link, we probably want to link to the work request directly.

Detecting regressions

Goals and requirements

About the debian:qa-results collection category

Common features across all QA results

Specification for lintian

Specification for autopkgtest

Specification for piuparts

Specification for blhc

Specification for debdiff

Implementation plan

Changes to the debian_pipeline workflow

When enable_regression_tracking is set

Handling of qa_failure_policy

Changes to the qa workflow

Behavior with update_qa_results set to True

Behavior with enable_regression_tracking set to True

Handling of fail_on

Various removals

Implementation of QA results regression analysis

For lintian

For autopkgtest

For piuparts and blhc

About the UI to display regression analysis

About the `debian:qa-results` collection category

Specification for `lintian`

Specification for `autopkgtest`

Specification for `piuparts`

Specification for `blhc`

Specification for `debdiff`

Changes to the `debian_pipeline` workflow

When `enable_regression_tracking` is set

Handling of `qa_failure_policy`

Changes to the `qa` workflow

Behavior with `update_qa_results` set to True

Behavior with `enable_regression_tracking` set to True

Handling of `fail_on`