Detecting regressions
Goals and requirements
When preparing updates, it is part of best practices to ensure that the update does not introduce any regression on all the QA checks. The concept of regression is thus to be managed at the level of the QA workflow.
Some regressions can be identified at the level of the result of the
respective QA tasks. An autopkgtest
that used to pass, and now fails,
is an obvious example. However, when both are failing, you might want
to dig deeper and see if the same set of tests are failing. If we can
identify new tests that are failing, we still have a regression.
On the opposite, a lintian analysis might indicate a success on both versions, but generate new warning-level tags, that we might want to report as a regression.
Other considerations:
We want the QA workflow to have a summary view showing side-by-side the result of the “QA tasks” on the original and updated package(s) along with a conclusion on whether it’s a regression or not. It should be possible to have detailed view of a specific comparison, showing a comparison of the artifacts generated by the QA tasks.
We want that table to be regularly updated, every time that a QA task finishes, without waiting for the completion of all QA tasks.
We want to be able to configure the
debian_pipeline
workflow so that any failure or any regression in theqa
workflow requires a manual confirmation to continue the root workflow. (Right now the result of theqa
workflow has no impact on thepackage_upload
workflow for example)
- debian:qa-results
About the debian:qa-results
collection category
This collection can host all the QA results for all the packages in a given suite. We thus have different kind of items for each QA task.
Common features across all QA results
The name of the collection item is always
{task_name}_{package}_{version}_{architecture}_{work_request_id}
substituting values from the per-item data described below.
Variables when adding items: see “Per-item data” below
Data:
suite_collection
(Single lookup, required): the debian:suite collection for which the collection is storing the QA results.old_items_to_keep
: number of old items to keep. Defaults to 5. For each task/package/architecture combination, the collection always keeps the given number of most recent entries (based on thetimestamp
field). The cleanup is automatically done when adding new items.Note
At some point, we may need more advanced logic than this, for instance to clean up QA results about packages that are gone from the corresponding suite.
Per-item data:
task_name
(inferred from work request when adding item): the name of the task that generated the QA result (soautopkgtest
,lintian
,piuparts
, in the future alsoblhc
)package
: the name of the source package being analyzed, or the source package from which the binary package being analyzed was builtversion
: the version of the source package being analyzed, or the source package from which the binary package being analyzed was builtarchitecture
:source
for a source analysis, or the appropriate architecture name for a binary analysistimestamp
: a Unix timestamp (cf.date +%s
) used to estimate the freshness of the QA result, can often be usefully tied to theDate
field of the package repository in which the analysis was performed.work_request_id
: the ID of the WorkRequest that generated the QA resultresult
(inferred from work request when adding item): duplicates the string value of the result field of the associated WorkRequest
Lookup names:
latest:TASKNAME_PACKAGE_ARCHITECTURE
: the latest QA result for tasks namedTASKNAME
for the source package namedPACKAGE
onARCHITECTURE
.
Specification for lintian
The collection item is a debian:lintian artifact. The collection contains items for the source and for all architectures.
A lintian analysis is outdated if:
either the underlying source or binary packages are outdated (i.e. have different version numbers) compared to what’s available in the debian:suite collection
or the lintian version used to perform the analysis is older than the version available in the debian:suite collection
Specification for autopkgtest
The collection item is a debian:autopkgtest artifact. The collection contains items for all architectures (but not for the source).
An autopkgtest analysis is outdated if:
either the underlying source or binary packages are outdated (i.e. have different version numbers) compared to what’s available in the debian:suite collection
or the timestamp of the analysis is older than 30 days compared to the
Date
timestamp of the debian:suite collection
Specification for piuparts
The collection item is a bare data item of category debian:qa-result
with all the common per-item data described above. The collection contains
items for all architectures (but not for the source).
A piuparts analysis is outdated if the underlying binary packages are outdated (i.e. have different version numbers) compared to what’s available in the debian:suite collection.
Note
The lack of piuparts artifact means that we don’t have any information about the binary packages that were analyzed except if we lookup the details of the WorkRequest. That’s probably going too far so instead we will likely compare based on the source version documented in the per-item data.
Filed #805 to think about introducing a proper artifact at some point.
Specification for blhc
The collection item is a debian:blhc artifact. The collection contains items for all architectures (but not for the source).
A blhc analysis is outdated if the underlying source package is outdated (i.e. has a smaller version number) compared to what’s available in the debian:suite collection. The comparison needs to be performed based on the metadata of the linked debian:package-build-log artifact.
Specification for debdiff
The DebDiff QA task does not contribute any item to the
debian:qa-results
because it does not provide any validation
of a single target artifact.
By its nature, the task already performs a comparison between the original version and the new version. And the result of the comparison can’t easily be used to draw any conclusion about the update, it is up to human reviewers to decide of that.
Implementation plan
Changes to the debian_pipeline
workflow
The workflow is expanded with new parameters:
enable_regression_tracking
(boolean, defaults to False): configure the QA workflow to detect and display regressions in QA results.regression_tracking_qa_results
(Single lookup, required ifenable_regression_tracking
is True): the debian:qa-results collection that contains the reference results of QA tasks to use to detect regressions.qa_failure_policy
(string, default valueignore
). The policy to apply when theqa
workflow failed. Allowed values areignore
,fail
,confirm
.
The parameter reverse_dependencies_autopkgtest_suite
is renamed
into qa_suite
and becomes required if either
enable_regression_tracking
or
enable_reverse_dependencies_autopkgtest
is True. It refers to
a debian:suite collection which can be used to schedule
reverse dependencies autopkgtest and/or generate the reference QA results
that are needed to detect regressions.
When enable_regression_tracking
is set
the normal
qa
workflow is run with:enable_regression_tracking
set to Truereference_qa_results
set with the value ofregression_tracking_qa_results
an extra
qa
workflow is run with:enable_regression_tracking
set to Falsereference_qa_results
set with the value ofregression_tracking_qa_results
update_qa_results
set to Trueprefix
set toreference-qa-result|
source_artifact
andbinary_artifacts
pointing to the corresponding artifacts in the debian:suite collection listed byqa_suite
fail_on
set tonever
Handling of qa_failure_policy
The current behaviour is ignore
. The workflow is scheduled but nothing
depends on it.
With the fail
policy, the remaining workflows (or at least the next
workflow in the dependency chain) gain dependencies on the qa
workflow
so that they are not executed before completion of the qa
workflow
and are effectively aborted if anything fails.
With the confirm
policy, the qa
workflow is scheduled with
workflow_data.allow_failure: true
and a new Confirm WAIT task is scheduled in between the qa
workflow and the remaining
parts of the workflow. That confirm task has the following task data:
auto_confirm_if_no_failure: true
confirm_label: "Continue the workflow despite QA failures"
Changes to the qa
workflow
The workflow is modified to also accept multiple
debian:binary-package artifacts as input in
binary_artifacts
. This will require ensuring that the sub-workflows are
able to accept a similar input.
The workflow is expanded with new parameters:
enable_regression_tracking
(defaults to False): configure the QA workflow to detect and display regressions in QA results.reference_qa_results
(Single lookup, optional): the collection of category debian:qa-results that contains the reference results of QA tasks to use to detect regressions.update_qa_results
(boolean, defaults to False): when set to True, the workflow runs QA tasks and updates the collection passed inreference_qa_results
with the results.prefix
(string, optional): prefix this string to the item names provided in the internal collectionfail_on
(string, optional): indicate the conditions to trigger a failure of the whole workflow. Allowed values arefailure
,regression
,never
. Withfailure
, the workflow is marked as failed if one of the QA task fails. Withregression
, the workflow fails only if one of the QA result is a regression compared to the former result. Withnever
, the workflow always succeeds. The default value isregression
ifenable_regression_tracking
is True, otherwise it isfailure
.
Behavior with update_qa_results
set to True
When update_qa_results
is set to True, the goal of the workflow
is modified: its only purpose is to provide reference results to
be stored in a debian:qa-results collection. Task failures are
never fatal for the parent workflow or for dependent tasks.
During orchestration, the workflow compares the data available in the
debian:qa-results collection together with information about
the submitted source_artifact
and binary_artifacts
.
When a missing or outdated QA result is detected, it schedules the appropriate QA task, and it creates a corresponding promise in the internal collection (the name of the promise is the prefix followed by the expected name of the collection entry). The QA task has the following event reactions:
on_assignment
: an action to skip the work request if the latest relevant item in the debian:qa-results collection has changed since the work request has created; this avoids wasting resources if multiple parallel workflows trigger an update of the same QA resultson_success
: an action to add the result to the debian:qa-results collectionon_failure
: same ason_success
Note that when enable_reverse_dependencies_autopkgtest
is set to True,
it must also update the autopkgtest results of the reverse dependencies
and thus compute the same list of packages as the
reverse_dependencies_autopkgtest
workflow (using the same
qa_suite
collection).
Behavior with enable_regression_tracking
set to True
When enable_regression_tracking
is set to True, the orchestrator
of the qa
workflow schedules workflow callbacks that will perform the regression analysis. In order
to wait for the availability of the QA result(s), those callbacks have
dependencies against:
the promises associated to the QA result(s) that are required from the additional
qa
workflow building reference resultsthe promises associated to the QA result(s) that are required from the sub-workflows
The workflow_data
field for those workflow callbacks have:
visible
set to False so that they do not show up in the workflow hierarchy (new feature to implement)step
set toregression-analysis
As part of the callback, the analysis is performed and the result
of the analysis is stored in the output_data
field of the workflow.
Note
We use simple workflow callbacks instead of full-fledged worker tasks or server tasks because we assume that regression analysis can be completed just by comparing the artifact metadata and/or the collection item. Workflow callbacks are already dealt through celery tasks so they are relatively cheap. Note however that the large number of callbacks requires use of careful locking to serialize the operations between concurrent runs trying to update the same workflow.
Handling of fail_on
With fail_on: never
or fail_on: regression
, all the sub-workflows
are run with workflow_data.allow_failure: true
.
With fail_on: regression
, a final orchestrator callback is scheduled:
it depends on all the
regression-analysis
callbacksworkflow_data.visible
is set to Trueworkflow_data.step
isfinal-regression-analysis
workflow_data.display_name
isRegression analysis
The callback reviews the data in output_data.regression_analysis
and sets its own result to FAILURE in case of regression, or SUCCESS
otherwise.
Note
The assumption is that the result of this callback will bubble
up to the qa
workflow since the callback itself should
be configured with workflow_data.allow_failure: false
.
If that assumption does not hold true, then we might want to simply set the result of the workflow since we know (through dependencies) that we are the last work request in the workflow so we should be able to just mark it as completed too!
Various removals
The debian:suite-lintian collection is removed since the debian:qa-results collection is effectively a superset of that collection.
Therefore the UpdateDerivedCollection and UpdateSuiteLintianCollection server tasks are removed too.
Implementation of QA results regression analysis
The output_data
of a qa workflow has a new regression_analysis
key which is a dictionary of such analysis. The key represents the
name of a test (e.g. autopkgtest:dpkg:amd64
) without any version
and the value is the result of the analysis which is defined as another
dictionary with the following keys:
original_url
(optional, can be set later when the QA result is available): URL pointing to the original artifact or bare-data collection item used for the comparisonnew_url
(optional, can be set later when the QA result is available): URL pointing to the new artifact or bare-data collection item used for the comparisonstatus
(required): a string value among the following values:no-result
: when the comparison has not been completed yet (usually because we lack one of the two required QA results)error
: when the comparison (or one of the required QA tasks) errored outimprovement
: when the new QA result is better than the original QA resultstable
: when the new QA result is neither better nor worse than the original QA resultregression
: when the new QA result is worse than the original QA result
details
(optional): an arbitrarily nested data-structure composed of lists and dictionaries where dictionary keys and leaf items (and/or leaf item values) are always strings. Expectation is that this structure is rendered as nested lists shown behind a collapsible section that can be unfolded to learn more about the analysis. The strings are HTML-escaped when rendered.
The regression analysis can lead to multiple results:
no-result: when we are lacking one of the QA results for the comparison
improvement: when the new QA result is “success” while the reference one is “failure”
no-regression: when the two QA results are “success”
regression: when the new QA result is “failure” while the reference one is “success”
error: when one of the work requests providing the required QA result errored out
Details can also be provided as output of the analysis, they will typically be displayed in the summary view of the qa workflow.
The first level of comparison is at the level of the result
of the WorkRequest
, following the logic above. But depending on the
output of the QA task, it is possible to have a finer-grained analysis.
The next sections details how those deeper comparisons are performed.
For lintian
We compare the summary.tags_count_by_severity
to determine the
status of the regression analysis:
SEVERITIES = ("warning", "error")
if any(new_count[s] > original_count[s] for s in SEVERITIES):
return "regression"
elif any(new_count[s] < original_count[s] for s in SEVERITIES):
return "improvement"
else
return "stable"
We also perform a comparison of the summary.tags_found
to indicate
in the details
field which new tags have been reported, and which tags
have disappeared.
Note
Among the difference of tags, there can be tags that have severities lower than warning and error, but we have no way to filter them out without loading the full analysis.json from the artifact which would be much more costly for almost no gain.
For autopkgtest
We compare the result of each individual test in the results
key
of the artifact metadata. Each result is classified on its own following
the table below, the first line that matches ends the classification
process:
ORIGINAL |
NEW |
RESULT |
---|---|---|
|
FLAKY |
stable |
PASS, SKIP |
FAIL |
regression |
FAIL, FLAKY |
PASS, SKIP |
improvement |
|
|
stable |
Each individual regression or improvement is noted and documented in the
details
field of the analysis.
To compute the global result of the regression analysis, the logic is the following:
if "regression" in comparison_of_tests:
return "regression"
elif "improvement" in comparison_of_tests:
return "improvement"
else:
return "stable"
For piuparts and blhc
The provided metadata do not allow for deep comparisons, so the comparison
is based on the result
of the corresponding WorkRequest (which is
duplicated in the per-item data of the debian:qa-results
collection).
The algorithm is the following:
if origin.result == SUCCESS and new.result == FAILURE:
return "regression"
elif origin.result == FAILURE and new.result == SUCCESS:
return "improvement"
else
return "stable"
About the UI to display regression analysis
Here’s an example of what the table could look like:
Test name |
Original result for dpkg_1.2.0 |
New result for dpkg_1.2.1 |
Conclusion |
---|---|---|---|
autopkgtest:dpkg_amd64 |
✅ |
❌ |
↘️ regression |
lintian:dpkg_source |
✅ |
✅ |
➡️ stable |
piuparts:dpkg_amd64 |
✅ |
❔ |
❔ no-result |
autopkgtest:apt_amd64 |
❌ |
✅ |
↗️ improvement |
Summary |
1 failure |
1 failure, 1 missing result |
↘️ regression |
Multiple comments about the desired table:
We should use the standard WorkRequest result widgets instead of the special characters (✅ and ❌) shown above.
We want to put links to the artifact for each QA result in the “Original result” and “New result” columns.
The number of autopkgtest results due to the reverse_dependencies_autopkgtest workflow can be overwhelming. Due to this, the autopkgtest lines that concern other source packages than the one processed in the current workflow are hidden if the regression analysis result is “stable” or “no-result”.
For piuparts tasks where we don’t have artifacts to link, we probably want to link to the work request directly.