Detecting regressions
Goals and requirements
When preparing updates, it is part of best practices to ensure that the update does not introduce any regression on all the QA checks. The concept of regression is thus to be managed at the level of the QA workflow.
Some regressions can be identified at the level of the result of the
respective QA tasks. An autopkgtest
that used to pass, and now fails,
is an obvious example. However, when both are failing, you might want
to dig deeper and see if the same set of tests are failing. If we can
identify new tests that are failing, we still have a regression.
On the opposite, a lintian analysis might indicate a success on both versions, but generate new warning-level tags, that we might want to report as a regression.
Other considerations:
We want the QA workflow to have a summary view showing side-by-side the result of the “QA tasks” on the original and updated package(s) along with a conclusion on whether it’s a regression or not. It should be possible to have detailed view of a specific comparison, showing a comparison of the artifacts generated by the QA tasks.
We want that table to be regularly updated, every time that a QA task finishes, without waiting for the completion of all QA tasks.
We want to be able to configure the
debian_pipeline
workflow so that any failure or any regression in theqa
workflow requires a manual confirmation to continue the root workflow. (Right now the result of theqa
workflow has no impact on thepackage_upload
workflow for example)
About the debian:qa-results
collection category
This collection can host all the QA results for all the packages in a given suite. We thus have different kind of items for each QA task.
Common features across all QA results
The name of the collection item is always
{task_name}:{package}:{version}:{architecture}:{work_request_id}
substituting values from the per-item data described below.
Data:
suite_collection
(Single lookup, required): thedebian:suite
collection for which the collection is storing the QA results.old_items_to_keep
: number of old items to keep. Defaults to 5. For each task/package/architecture combination, the collection always keeps the given number of most recent entries (based on thetimestamp
field). The cleanup is automatically done when adding new items.Note
At some point, we may need more advanced logic than this, for instance to clean up QA results about packages that are gone from the corresponding suite.
Per-item data:
task_name
: the name of the task that generated the QA result (soautopkgtest
,lintian
,piuparts
, in the future alsoblhc
)package
: the name of the source package being analyzed, or the source package from which the binary package being analyzed was builtversion
: the version of the source package being analyzed, or the source package from which the binary package being analyzed was builtarchitecture
:source
for a source analysis, or the appropriate architecture name for a binary analysistimestamp
: a Unix timestamp (cf.date +%s
) used to estimate the freshness of the QA result, can often be usefully tied to theDate
field of the package repository in which the analysis was performed.work_request_id
: the ID of the WorkRequest that generated the QA resultresult
: duplicates the string value of the result field of the associated WorkRequest
Lookup names:
latest:TASK_NAME:PACKAGE:ARCHITECTURE
: the latest QA result for the source package namedPACKAGE
onARCHITECTURE
.
Specification for lintian
The collection item is a debian:lintian
artifact (see
Category debian:lintian). The collection contains items for the source and
for all architectures.
A lintian analysis is outdated if:
either the underlying source or binary packages are outdated (i.e. have different version numbers) compared to what’s available in the
debian:suite
collectionor the lintian version used to perform the analysis is older than the version available in the
debian:suite
collection
Specification for autopkgtest
The collection item is a debian:autopkgtest
artifact (see
Category debian:autopkgtest). The collection contains items for all
architectures (but not for the source).
An autopkgtest analysis is outdated if:
either the underlying source or binary packages are outdated (i.e. have different version numbers) compared to what’s available in the
debian:suite
collectionor the timestamp of the analysis is older than 30 days compared to the
Date
timestamp of thedebian:suite
collection
Specification for piuparts
The collection item is a bare data item of category debian:qa-result
with all the common per-item data described above. The collection contains
items for all architectures (but not for the source).
A piuparts analysis is outdated if the underlying binary packages are
outdated (i.e. have smaller version numbers) compared to what’s available
in the debian:suite
collection.
Note
The lack of piuparts artifact means that we don’t have any information about the binary packages that were analyzed except if we lookup the details of the WorkRequest. That’s probably going too far so instead we will likely compare based on the source version documented in the per-item data.
Filed https://salsa.debian.org/freexian-team/debusine/-/issues/805 to think about introducing a proper artifact at some point.
Specification for blhc
The collection item is a debian:blhc
artifact (see
Category debian:blhc). The collection contains items for all
architectures (but not for the source).
A blhc analysis is outdated if the underlying source package is
outdated (i.e. has a smaller version number) compared to what’s available
in the debian:suite
collection. The comparison needs to be
performed based on the metadata of the linked
debian:package-build-logs
artifact.
Specification for debdiff
The debdiff
QA task does not contribute any item to the
debian:qa-results
because it does not provide any validation
of a single target artifact.
By its nature, the task already performs a comparison between the original version and the new version. And the result of the comparison can’t easily be used to draw any conclusion about the update, it is up to human reviewers to decide of that.
Implementation plan
Changes to the debian_pipeline
workflow
The workflow is expanded with new parameters:
enable_regression_tracking
(boolean, defaults to False): configure the QA workflow to detect and display regressions in QA results.regression_tracking_qa_results
(Single lookup, required ifenable_regression_tracking
is True): thedebian:qa-results
collection that contains the reference results of QA tasks to use to detect regressions.qa_failure_policy
(string, default valueignore
). The policy to apply when theqa
workflow failed. Allowed values areignore
,fail
,confirm
.
The parameter reverse_dependencies_autopkgtest_suite
is renamed
into qa_suite
and becomes required if either
enable_regression_tracking
or
enable_reverse_dependencies_autopkgtest
is True. It refers to
a debian:suite
collection which can be used to schedule reverse
dependencies autopkgtest and/or generate the reference QA results that
are needed to detect regressions.
When enable_regression_tracking
is set
the normal
qa
workflow is run with:enable_regression_tracking
set to Truereference_qa_results
set with the value ofregression_tracking_qa_results
an extra
qa
workflow is run with:enable_regression_tracking
set to Falsereference_qa_results
set with the value ofregression_tracking_qa_results
update_qa_results
set to Trueprefix
set toreference-qa-result|
source_artifact
andbinary_artifacts
pointing to the corresponding artifacts in thedebian:suite
collection listed byqa_suite
fail_on
set tonever
Handling of qa_failure_policy
The current behaviour is ignore
. The workflow is scheduled but nothing
depends on it.
With the fail
policy, the remaining workflows (or at least the next
workflow in the dependency chain) gain dependencies on the qa
workflow
so that they are not executed before completion of the qa
workflow
and are effectively aborted if anything fails.
With the confirm
policy, the qa
workflow is scheduled with
workflow_data.allow_failure: true
and a new Confirm
WAIT task is scheduled in between the qa
workflow and
the remaining parts of the workflow. That confirm task has the following
task data:
auto_confirm_if_no_failure: true
confirm_label: "Continue the workflow despite QA failures"
Changes to the qa
workflow
The workflow is modified to also accept multiple debian:binary-package
as input in binary_artifacts
. This will require ensuring that
the sub-workflows are able to accept a similar input.
The workflow is expanded with new parameters:
enable_regression_tracking
(defaults to False): configure the QA workflow to detect and display regressions in QA results.reference_qa_results
(Single lookup, optional): the collection of categorydebian:qa-results
that contains the reference results of QA tasks to use to detect regressions.update_qa_results
(boolean, defaults to False): when set to True, the workflow runs QA tasks and updates the collection passed inreference_qa_results
with the results.prefix
(string, optional): prefix this string to the item names provided in the internal collectionfail_on
(string, optional): indicate the conditions to trigger a failure of the whole workflow. Allowed values arefailure
,regression
,never
. Withfailure
, the workflow is marked as failed if one of the QA task fails. Withregression
, the workflow fails only if one of the QA result is a regression compared to the former result. Withnever
, the workflow always succeeds. The default value isregression
ifenable_regression_tracking
is True, otherwise it isfailure
.
Behavior with update_qa_results
set to True
When update_qa_results
is set to True, the goal of the workflow
is modified: its only purpose is to provide reference results to
be stored in a debian:qa-results
collection. Task failures are
never fatal for the parent workflow or for dependent tasks.
During orchestration, the workflow compares the data available in the
debian:qa-results
collection together with information about
the submitted source_artifact
and binary_artifacts
.
When a missing or outdated QA result is detected, it schedules the
appropriate QA task (with an event reaction to update the
debian:qa-results
collection) and it creates a corresponding promise
in the internal collection (the name of the promise is the prefix followed
by the expected name of the collection entry).
Note that when enable_reverse_dependencies_autopkgtest
is set to True,
it must also update the autopkgtest results of the reverse dependencies
and thus compute the same list of packages as the
reverse_dependencies_autopkgtest
workflow (using the same
qa_suite
collection).
Note
If multiple parallel workflows trigger the update of the same QA results, we will waste resources to compute those multiple times. In order to avoid this, we can imagine some mechanisn to detect when such work requests become obsolete.
The suggestion is to create a new on_assign
event processed
in work_request.assign_worker()
where we could hook a new event
reaction skip_if_collection_item_changed
. That event reaction would
be fed:
a single lookup that is expected to return the QA result stored in the qa-results collection
the resulting collection item ID that was used as basis at orchestration time
the name of the associated promise
When the event reaction is processed, it performs the lookup. If the
obtained collection item ID matches the one that was supplied, then the
event reaction does nothing and the worker assignment finishes
normally. On the other hand, if the obtained collection item ID differs
from the one that was supplied, it means that newer data has been
stored since the work request has been scheduled. In this situation, we
assume that we can avoid running the work request and that we can rely
on the newer data. The work request is marked as successfully
completed, but a warning is added in output_data
to explain that
the work request has been skipped due to the newer data. The promise in
the internal collection is replaced by the artifact retrieved from the
collection. And the event reaction finishes by raising some specific
exception that cancels the worker assignment process.
Behavior with enable_regression_tracking
set to True
When enable_regression_tracking
is set to True, the orchestrator
of the qa
workflow schedules workflow callbacks that will perform the regression analysis. In order
to wait for the availability of the QA result(s), those callbacks have
dependencies against:
the promises associated to the QA result(s) that are required from the additional
qa
workflow building reference resultsthe promises associated to the QA result(s) that are required from the sub-workflows
The workflow_data
field for those workflow callbacks have:
visible
set to False so that they do not show up in the workflow hierarchy (new feature to implement)step
set toregression-analysis
As part of the callback, the analysis is performed and the result
of the analysis is stored in the output_data
field of the workflow.
Note
We use simple workflow callbacks instead of full-fledged worker tasks or server tasks because we assume that regression analysis can be completed just by comparing the artifact metadata and/or the collection item. Workflow callbacks are already dealt through celery tasks so they are relatively cheap. Note however that the large number of callbacks requires use of careful locking to serialize the operations between concurrent runs trying to update the same workflow.
Handling of fail_on
With fail_on: never
or fail_on: regression
, all the sub-workflows
are run with workflow_data.allow_failure: true
.
With fail_on: regression
, a final orchestrator callback is scheduled:
it depends on all the
regression-analysis
callbacksworkflow_data.visible
is set to Trueworkflow_data.step
isfinal-regression-analysis
workflow_data.display_name
isRegression analysis
The callback reviews the data in output_data.regression_analysis
and sets its own result to FAILURE in case of regression, or SUCCESS
otherwise.
Note
The assumption is that the result of this callback will bubble
up to the qa
workflow since the callback itself should
be configured with workflow_data.allow_failure: false
.
If that assumption does not hold true, then we might want to simply set the result of the workflow since we know (through dependencies) that we are the last work request in the workflow so we should be able to just mark it as completed too!
Various removals
The debian:suite-lintian
collection is removed since the
debian:qa-results
collection is effectively a superset of that
collection.
Therefore the UpdateDerivedCollection
and
UpdateSuiteLintianCollection
server tasks are removed too.
Implementation of QA results regression analysis
The output_data
of a qa workflow has a new regression_analysis
key which is a dictionary of such analysis. The key represents the
name of a test (e.g. autopkgtest:dpkg:amd64
) without any version
and the value is the result of the analysis which is defined as another
dictionary with the following keys:
original_url
(optional, can be set later when the QA result is available): URL pointing to the original artifact or bare-data collection item used for the comparisonnew_url
(optional, can be set later when the QA result is available): URL pointing to the new artifact or bare-data collection item used for the comparisonstatus
(required): a string value among the following values:no-result
: when the comparison has not been completed yet (usually because we lack one of the two required QA results)error
: when the comparison (or one of the required QA tasks) errored outimprovement
: when the new QA result is better than the original QA resultstable
: when the new QA result is neither better nor worse than the original QA resultregression
: when the new QA result is worse than the original QA result
details
(optional): an arbitrarily nested data-structure composed of lists and dictionaries where dictionary keys and leaf items (and/or leaf item values) are always strings. Expectation is that this structure is rendered as nested lists shown behind a collapsible section that can be unfolded to learn more about the analysis. The strings are HTML-escaped when rendered.
The regression analysis can lead to multiple results:
no-result: when we are lacking one of the QA results for the comparison
improvement: when the new QA result is “success” while the reference one is “failure”
no-regression: when the two QA results are “success”
regression: when the new QA result is “failure” while the reference one is “success”
error: when one of the work requests providing the required QA result errored out
Details can also be provided as output of the analysis, they will typically be displayed in the summary view of the qa workflow.
The first level of comparison is at the level of the result
of the WorkRequest
, following the logic above. But depending on the
output of the QA task, it is possible to have a finer-grained analysis.
The next sections details how those deeper comparisons are performed.
For lintian
We compare the summary.tags_count_by_severity
to determine the
status of the regression analysis:
SEVERITIES = ("warning", "error")
if any(new_count[s] > original_count[s] for s in SEVERITIES):
return "regression"
elif any(new_count[s] < original_count[s] for s in SEVERITIES):
return "improvement"
else
return "stable"
We also perform a comparison of the summary.tags_found
to indicate
in the details
field which new tags have been reported, and which tags
have disappeared.
Note
Among the difference of tags, there can be tags that have severities lower than warning and error, but we have no way to filter them out without loading the full analysis.json from the artifact which would be much more costly for almost no gain.
For autopkgtest
We compare the result of each individual test in the results
key
of the artifact metadata. Each result is classified on its own following
the table below, the first line that matches ends the classification
process:
ORIGINAL |
NEW |
RESULT |
---|---|---|
|
FLAKY |
stable |
PASS, SKIP |
FAIL |
regression |
FAIL, FLAKY |
PASS, SKIP |
improvement |
|
|
stable |
Each individual regression or improvement is noted and documented in the
details
field of the analysis.
To compute the global result of the regression analysis, the logic is the following:
if "regression" in comparison_of_tests:
return "regression"
elif "improvement" in comparison_of_tests:
return "improvement"
else:
return "stable"
For piuparts and blhc
The provided metadata do not allow for deep comparisons, so the comparison
is based on the result
of the corresponding WorkRequest (which is
duplicated in the per-item data of the debian:qa-results
collection).
The algorithm is the following:
if origin.result == SUCCESS and new.result == FAILURE:
return "regression"
elif origin.result == FAILURE and new.result == SUCCESS:
return "improvement"
else
return "stable"
About the UI to display regression analysis
Here’s an example of what the table could look like:
Test name |
Original result for dpkg_1.2.0 |
New result for dpkg_1.2.1 |
Conclusion |
---|---|---|---|
autopkgtest:dpkg_amd64 |
✅ |
❌ |
↘️ regression |
lintian:dpkg_source |
✅ |
✅ |
➡️ stable |
piuparts:dpkg_amd64 |
✅ |
❔ |
❔ no-result |
autopkgtest:apt_amd64 |
❌ |
✅ |
↗️ improvement |
Summary |
1 failure |
1 failure, 1 missing result |
↘️ regression |
Multiple comments about the desired table:
We should use the standard WorkRequest result widgets instead of the special characters (✅ and ❌) shown above.
We want to put links to the artifact for each QA result in the “Original result” and “New result” columns.
The number of autopkgtest results due to the reverse_dependencies_autopkgtest workflow can be overwhelming. Due to this, the autopkgtest lines that concern other source packages than the one processed in the current workflow are hidden if the regression analysis result is “stable” or “no-result”.
For piuparts tasks where we don’t have artifacts to link, we probably want to link to the work request directly.