Detecting regressions
Goals and requirements
When preparing updates, it is part of best practices to ensure that the update does not introduce any regression on all the QA checks. The concept of regression is thus to be managed at the level of the QA workflow.
Some regressions can be identified at the level of the result of the
respective QA tasks. An autopkgtest that used to pass, and now fails,
is an obvious example. However, when both are failing, you might want
to dig deeper and see if the same set of tests are failing. If we can
identify new tests that are failing, we still have a regression.
On the opposite, a lintian analysis might indicate a success on both versions, but generate new warning-level tags, that we might want to report as a regression.
Other considerations:
We want the QA workflow to have a summary view showing side-by-side the result of the “QA tasks” on the original and updated package(s) along with a conclusion on whether it’s a regression or not. It should be possible to have detailed view of a specific comparison, showing a comparison of the artifacts generated by the QA tasks.
We want that table to be regularly updated, every time that a QA task finishes, without waiting for the completion of all QA tasks.
We want to be able to configure the
debian_pipelineworkflow so that any failure or any regression in theqaworkflow requires a manual confirmation to continue the root workflow. (Right now the result of theqaworkflow has no impact on thepackage_uploadworkflow for example)
Implementation of QA results regression analysis
The output_data of a qa workflow has a new regression_analysis
key which is a dictionary of such analysis. The key represents the
name of a test (e.g. autopkgtest:dpkg:amd64) without any version
and the value is the result of the analysis which is defined as another
dictionary with the following keys:
original_url(optional, can be set later when the QA result is available): URL pointing to the original artifact or bare-data collection item used for the comparisonnew_url(optional, can be set later when the QA result is available): URL pointing to the new artifact or bare-data collection item used for the comparisonstatus(required): a string value among the following values:no-result: when the comparison has not been completed yet (usually because we lack one of the two required QA results)error: when the comparison (or one of the required QA tasks) errored outimprovement: when the new QA result is better than the original QA resultstable: when the new QA result is neither better nor worse than the original QA resultregression: when the new QA result is worse than the original QA result
details(optional): an arbitrarily nested data-structure composed of lists and dictionaries where dictionary keys and leaf items (and/or leaf item values) are always strings. Expectation is that this structure is rendered as nested lists shown behind a collapsible section that can be unfolded to learn more about the analysis. The strings are HTML-escaped when rendered.
The regression analysis can lead to multiple results:
no-result: when we are lacking one of the QA results for the comparison
improvement: when the new QA result is “success” while the reference one is “failure”
stable: when the two QA results are “success”
regression: when the new QA result is “failure” while the reference one is “success”
error: when one of the work requests providing the required QA result errored out
Details can also be provided as output of the analysis, they will typically be displayed in the summary view of the qa workflow.
The first level of comparison is at the level of the result
of the WorkRequest, following the logic above. But depending on the
output of the QA task, it is possible to have a finer-grained analysis.
The next sections details how those deeper comparisons are performed.
For lintian
We compare the summary.tags_count_by_severity to determine the
status of the regression analysis:
SEVERITIES = ("warning", "error")
if any(new_count[s] > original_count[s] for s in SEVERITIES):
return "regression"
elif any(new_count[s] < original_count[s] for s in SEVERITIES):
return "improvement"
else
return "stable"
We also perform a comparison of the summary.tags_found to indicate
in the details field which new tags have been reported, and which tags
have disappeared.
Note
Among the difference of tags, there can be tags that have severities lower than warning and error, but we have no way to filter them out without loading the full analysis.json from the artifact which would be much more costly for almost no gain.
For autopkgtest
We compare the result of each individual test in the results key
of the artifact metadata. Each result is classified on its own following
the table below, the first line that matches ends the classification
process:
ORIGINAL |
NEW |
RESULT |
|---|---|---|
|
FLAKY |
stable |
PASS, SKIP |
FAIL |
regression |
FAIL, FLAKY |
PASS, SKIP |
improvement |
|
|
stable |
Each individual regression or improvement is noted and documented in the
details field of the analysis.
To compute the global result of the regression analysis, the logic is the following:
if "regression" in comparison_of_tests:
return "regression"
elif "improvement" in comparison_of_tests:
return "improvement"
else:
return "stable"
For piuparts and blhc
The provided metadata do not allow for deep comparisons, so the comparison
is based on the result of the corresponding WorkRequest (which is
duplicated in the per-item data of the debian:qa-results
collection).
The algorithm is the following:
if origin.result == SUCCESS and new.result == FAILURE:
return "regression"
elif origin.result == FAILURE and new.result == SUCCESS:
return "improvement"
else
return "stable"
About the UI to display regression analysis
Here’s an example of what the table could look like:
Test name |
Original result for dpkg_1.2.0 |
New result for dpkg_1.2.1 |
Conclusion |
|---|---|---|---|
autopkgtest:dpkg_amd64 |
✅ |
❌ |
↘️ regression |
lintian:dpkg_source |
✅ |
✅ |
➡️ stable |
piuparts:dpkg_amd64 |
✅ |
❔ |
❔ no-result |
autopkgtest:apt_amd64 |
❌ |
✅ |
↗️ improvement |
Summary |
1 failure |
1 failure, 1 missing result |
↘️ regression |
Multiple comments about the desired table:
We should use the standard WorkRequest result widgets instead of the special characters (✅ and ❌) shown above.
We want to put links to the artifact for each QA result in the “Original result” and “New result” columns.
The number of autopkgtest results due to the reverse_dependencies_autopkgtest workflow can be overwhelming. Due to this, the autopkgtest lines that concern other source packages than the one processed in the current workflow are hidden if the regression analysis result is “stable” or “no-result”.
For piuparts tasks where we don’t have artifacts to link, we probably want to link to the work request directly.