=====================
Detecting regressions
=====================

Goals and requirements
======================

When preparing updates, it is part of best practices to ensure that
the update does not introduce any regression on all the QA checks.
The concept of regression is thus to be managed at the level of the
QA workflow.

Some regressions can be identified at the level of the result of the
respective QA tasks. An ``autopkgtest`` that used to pass, and now fails,
is an obvious example. However, when both are failing, you might want
to dig deeper and see if the same set of tests are failing. If we can
identify new tests that are failing, we still have a regression.

On the opposite, a lintian analysis might indicate a success on both
versions, but generate new warning-level tags, that we might want to report
as a regression.

Other considerations:

* We want the QA workflow to have a summary view showing side-by-side the
  result of the "QA tasks" on the original and updated package(s) along
  with a conclusion on whether it's a regression or not. It should be
  possible to have detailed view of a specific comparison, showing
  a comparison of the artifacts generated by the QA tasks.
* We want that table to be regularly updated, every time that a QA task
  finishes, without waiting for the completion of all QA tasks.
* We want to be able to configure the ``debian_pipeline`` workflow
  so that any failure or any regression in the ``qa`` workflow requires a
  manual confirmation to continue the root workflow. (Right now the result of
  the ``qa`` workflow has no impact on the ``package_upload`` workflow for
  example)

.. collection:: debian:qa-results

About the ``debian:qa-results`` collection category
===================================================

This collection can host all the QA results for all the packages in a
given suite. We thus have different kind of items for each QA task.

Common features across all QA results
-------------------------------------

The name of the collection item is always
``{task_name}_{package}_{version}_{architecture}_{work_request_id}``
substituting values from the per-item data described below.

* Variables when adding items: see "Per-item data" below

* Data:

  * ``suite_collection`` (:ref:`lookup-single`, required): the
    :collection:`debian:suite` collection for which the collection is
    storing the QA results.
  * ``old_items_to_keep``: number of old items to keep. Defaults to 5.
    For each task/package/architecture combination, the collection always
    keeps the given number of most recent entries (based on the
    ``timestamp`` field). The cleanup is automatically done when adding
    new items.

    .. note::

        At some point, we may need more advanced logic than this, for
        instance to clean up QA results about packages that are gone
        from the corresponding suite.

* Per-item data:

  * ``task_name`` (inferred from work request when adding item): the name of
    the task that generated the QA result (so ``autopkgtest``, ``lintian``,
    ``piuparts``, in the future also ``blhc``)
  * ``package``: the name of the source package being analyzed, or the
    source package from which the binary package being analyzed was built
  * ``version``: the version of the source package being analyzed, or the
    source package from which the binary package being analyzed was built
  * ``architecture``: ``source`` for a source analysis, or the appropriate
    architecture name for a binary analysis
  * ``timestamp``: a Unix timestamp (cf. ``date +%s``) used to estimate
    the freshness of the QA result, can often be usefully tied to the
    ``Date`` field of the package repository in which the analysis was
    performed.
  * ``work_request_id``: the ID of the WorkRequest that generated
    the QA result
  * ``result`` (inferred from work request when adding item): duplicates the
    string value of the result field of the associated WorkRequest

* Lookup names:

  * ``latest:TASKNAME_PACKAGE_ARCHITECTURE``: the latest QA result for tasks
    named ``TASKNAME`` for the source package named ``PACKAGE`` on
    ``ARCHITECTURE``.

Specification for ``lintian``
-----------------------------

The collection item is a :artifact:`debian:lintian` artifact. The collection
contains items for the source and for all architectures.

A lintian analysis is outdated if:

* either the underlying source or binary packages are outdated (i.e. have
  different version numbers) compared to what's available in the
  :collection:`debian:suite` collection
* or the lintian version used to perform the analysis is older than the
  version available in the :collection:`debian:suite` collection

Specification for ``autopkgtest``
---------------------------------

The collection item is a :artifact:`debian:autopkgtest` artifact. The
collection contains items for all architectures (but not for the source).

An autopkgtest analysis is outdated if:

* either the underlying source or binary packages are outdated (i.e. have
  different version numbers) compared to what's available in the
  :collection:`debian:suite` collection
* or the timestamp of the analysis is older than 30 days compared
  to the ``Date`` timestamp of the :collection:`debian:suite` collection

Specification for ``piuparts``
------------------------------

The collection item is a bare data item of category ``debian:qa-result``
with all the common per-item data described above. The collection contains
items for all architectures (but not for the source).

A piuparts analysis is outdated if the underlying binary packages are
outdated (i.e. have different version numbers) compared to what's available
in the :collection:`debian:suite` collection.

.. note::

   The lack of piuparts artifact means that we don't have any information
   about the binary packages that were analyzed except if we lookup the
   details of the WorkRequest. That's probably going too far so instead
   we will likely compare based on the source version documented in the
   per-item data.

   Filed :issue:`805` to think about introducing a proper artifact at some
   point.

Specification for ``blhc``
--------------------------

The collection item is a :artifact:`debian:blhc` artifact. The collection
contains items for all architectures (but not for the source).

A blhc analysis is outdated if the underlying source package is
outdated (i.e. has a smaller version number) compared to what's available
in the :collection:`debian:suite` collection. The comparison needs to be
performed based on the metadata of the linked
:artifact:`debian:package-build-log` artifact.

Specification for ``debdiff``
-----------------------------

The :task:`DebDiff` QA task does not contribute any item to the
``debian:qa-results`` because it does not provide any validation
of a single target artifact.

By its nature, the task already performs a comparison between
the original version and the new version. And the result of the comparison
can't easily be used to draw any conclusion about the update, it
is up to human reviewers to decide of that.

Implementation plan
===================

Changes to the ``debian_pipeline`` workflow
-------------------------------------------

The workflow is expanded with new parameters:

* ``enable_regression_tracking`` (boolean, defaults to False): configure the QA
  workflow to detect and display regressions in QA results.
* ``regression_tracking_qa_results`` (:ref:`lookup-single`, required if
  ``enable_regression_tracking`` is True): the
  :collection:`debian:qa-results` collection that contains the reference
  results of QA tasks to use to detect regressions.
* ``qa_failure_policy`` (string, default value ``ignore``). The policy to apply
  when the ``qa`` workflow failed. Allowed values are ``ignore``,
  ``fail``, ``confirm``.

The parameter ``reverse_dependencies_autopkgtest_suite`` is renamed
into ``qa_suite`` and becomes required if either
``enable_regression_tracking`` or
``enable_reverse_dependencies_autopkgtest`` is True. It refers to
a :collection:`debian:suite` collection which can be used to schedule
reverse dependencies autopkgtest and/or generate the reference QA results
that are needed to detect regressions.

When ``enable_regression_tracking`` is set
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* the normal ``qa`` workflow is run with:

  * ``enable_regression_tracking`` set to True
  * ``reference_qa_results`` set with the value of
    ``regression_tracking_qa_results``

* an extra ``qa`` workflow is run with:

  * ``enable_regression_tracking`` set to False
  * ``reference_qa_results`` set with the value of
    ``regression_tracking_qa_results``
  * ``update_qa_results`` set to True
  * ``prefix`` set to ``reference-qa-result|``
  * ``source_artifact`` and ``binary_artifacts`` pointing to the
    corresponding artifacts in the :collection:`debian:suite` collection
    listed by ``qa_suite``
  * ``fail_on`` set to ``never``

Handling of ``qa_failure_policy``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The current behaviour is ``ignore``. The workflow is scheduled but nothing
depends on it.

With the ``fail`` policy, the remaining workflows (or at least the next
workflow in the dependency chain) gain dependencies on the ``qa`` workflow
so that they are not executed before completion of the ``qa`` workflow
and are effectively aborted if anything fails.

With the ``confirm`` policy, the ``qa`` workflow is scheduled with
``workflow_data.allow_failure: true`` and a new :task:`Confirm WAIT task
<Confirm>` is scheduled in between the ``qa`` workflow and the remaining
parts of the workflow. That confirm task has the following task data:

* ``auto_confirm_if_no_failure: true``
* ``confirm_label: "Continue the workflow despite QA failures"``

Changes to the ``qa`` workflow
------------------------------

The workflow is modified to also accept multiple
:artifact:`debian:binary-package` artifacts as input in
``binary_artifacts``. This will require ensuring that the sub-workflows are
able to accept a similar input.

The workflow is expanded with new parameters:

* ``enable_regression_tracking`` (defaults to False): configure the QA
  workflow to detect and display regressions in QA results.
* ``reference_qa_results`` (:ref:`lookup-single`, optional): the
  collection of category :collection:`debian:qa-results` that contains the
  reference results of QA tasks to use to detect regressions.
* ``update_qa_results`` (boolean, defaults to False): when set to True,
  the workflow runs QA tasks and updates the collection passed in
  ``reference_qa_results`` with the results.
* ``prefix`` (string, optional): prefix this string to the item names
  provided in the internal collection
* ``fail_on`` (string, optional): indicate the conditions to trigger a
  failure of the whole workflow. Allowed values are ``failure``,
  ``regression``, ``never``. With ``failure``, the workflow is marked as
  failed if one of the QA task fails. With ``regression``, the workflow
  fails only if one of the QA result is a regression compared to the
  former result. With ``never``, the workflow always succeeds. The default
  value is ``regression`` if ``enable_regression_tracking`` is True,
  otherwise it is ``failure``.

Behavior with ``update_qa_results`` set to True
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When ``update_qa_results`` is set to True, the goal of the workflow
is modified: its only purpose is to provide reference results to
be stored in a :collection:`debian:qa-results` collection. Task failures are
never fatal for the parent workflow or for dependent tasks.

During orchestration, the workflow compares the data available in the
:collection:`debian:qa-results` collection together with information about
the submitted ``source_artifact`` and ``binary_artifacts``.

When a missing or outdated QA result is detected, it schedules the
appropriate QA task, and it creates a corresponding promise in the internal
collection (the name of the promise is the prefix followed by the expected
name of the collection entry).  The QA task has the following event
reactions:

* ``on_assignment``: an action to :ref:`skip the work request
  <action-skip-if-lookup-result-changed>` if the latest relevant item in the
  :collection:`debian:qa-results` collection has changed since the work
  request has created; this avoids wasting resources if multiple parallel
  workflows trigger an update of the same QA results
* ``on_success``: an action to add the result to the
  :collection:`debian:qa-results` collection
* ``on_failure``: same as ``on_success``

Note that when ``enable_reverse_dependencies_autopkgtest`` is set to True,
it must also update the autopkgtest results of the reverse dependencies
and thus compute the same list of packages as the
``reverse_dependencies_autopkgtest`` workflow (using the same
``qa_suite`` collection).

Behavior with ``enable_regression_tracking`` set to True
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When ``enable_regression_tracking`` is set to True, the orchestrator
of the ``qa`` workflow schedules :ref:`workflow callbacks
<workflow-callback>` that will perform the regression analysis. In order
to wait for the availability of the QA result(s), those callbacks have
dependencies against:

* the promises associated to the QA result(s) that are required from the
  additional ``qa`` workflow building reference results
* the promises associated to the QA result(s) that are required from the
  sub-workflows

The ``workflow_data`` field for those workflow callbacks have:

* ``visible`` set to False so that they do not show up in the workflow
  hierarchy (new feature to implement)
* ``step`` set to ``regression-analysis``

As part of the callback, the analysis is performed and the result
of the analysis is stored in the ``output_data`` field of the workflow.

.. note::

   We use simple workflow callbacks instead of full-fledged worker
   tasks or server tasks because we assume that regression analysis
   can be completed just by comparing the artifact metadata and/or
   the collection item. Workflow callbacks are already dealt through
   celery tasks so they are relatively cheap. Note however that the
   large number of callbacks requires use of careful locking to
   serialize the operations between concurrent runs trying to update
   the same workflow.

Handling of ``fail_on``
~~~~~~~~~~~~~~~~~~~~~~~

With ``fail_on: never`` or ``fail_on: regression``, all the sub-workflows
are run with ``workflow_data.allow_failure: true``.

With ``fail_on: regression``, a final orchestrator callback is scheduled:

* it depends on all the ``regression-analysis`` callbacks
* ``workflow_data.visible`` is set to True
* ``workflow_data.step`` is ``final-regression-analysis``
* ``workflow_data.display_name`` is ``Regression analysis``

The callback reviews the data in ``output_data.regression_analysis``
and sets its own result to FAILURE in case of regression, or SUCCESS
otherwise.

.. note::

   The assumption is that the result of this callback will bubble
   up to the ``qa`` workflow since the callback itself should
   be configured with ``workflow_data.allow_failure: false``.

   If that assumption does not hold true, then we might want to simply
   set the result of the workflow since we know (through dependencies)
   that we are the last work request in the workflow so we should
   be able to just mark it as completed too!

Various removals
----------------

The :collection:`debian:suite-lintian` collection is removed since the
:collection:`debian:qa-results` collection is effectively a superset of that
collection.

Therefore the :task:`UpdateDerivedCollection` and
:task:`UpdateSuiteLintianCollection` server tasks are removed too.

Implementation of QA results regression analysis
================================================

The ``output_data`` of a qa workflow has a new ``regression_analysis``
key which is a dictionary of such analysis. The key represents the
name of a test (e.g. ``autopkgtest:dpkg:amd64``) without any version
and the value is the result of the analysis which is defined as another
dictionary with the following keys:

* ``original_url`` (optional, can be set later when the QA result is
  available): URL pointing to the original artifact or bare-data collection
  item used for the comparison
* ``new_url`` (optional, can be set later when the QA result is available):
  URL pointing to the new artifact or bare-data collection item used
  for the comparison
* ``status`` (required): a string value among the following values:

  * ``no-result``: when the comparison has not been completed yet (usually
    because we lack one of the two required QA results)
  * ``error``: when the comparison (or one of the required QA tasks) errored out
  * ``improvement``: when the new QA result is better than the original
    QA result
  * ``stable``: when the new QA result is neither better nor worse than
    the original QA result
  * ``regression``: when the new QA result is worse than the original QA
    result

* ``details`` (optional): an arbitrarily nested data-structure composed of
  lists and dictionaries where dictionary keys and leaf items (and/or leaf
  item values) are always strings. Expectation is that this structure is
  rendered as nested lists shown behind a collapsible section that can be
  unfolded to learn more about the analysis. The strings are HTML-escaped
  when rendered.

The regression analysis can lead to multiple results:

* no-result: when we are lacking one of the QA results for the comparison
* improvement: when the new QA result is "success" while the reference one
  is "failure"
* no-regression: when the two QA results are "success"
* regression: when the new QA result is "failure" while the reference one
  is "success"
* error: when one of the work requests providing the required QA result
  errored out

Details can also be provided as output of the analysis, they will
typically be displayed in the summary view of the qa workflow.

The first level of comparison is at the level of the ``result``
of the ``WorkRequest``, following the logic above. But depending on the
output of the QA task, it is possible to have a finer-grained analysis.
The next sections details how those deeper comparisons are performed.

For lintian
-----------

We compare the ``summary.tags_count_by_severity`` to determine the
status of the regression analysis::

    SEVERITIES = ("warning", "error")
    if any(new_count[s] > original_count[s] for s in SEVERITIES):
        return "regression"
    elif any(new_count[s] < original_count[s] for s in SEVERITIES):
        return "improvement"
    else
        return "stable"

We also perform a comparison of the ``summary.tags_found`` to indicate
in the ``details`` field which new tags have been reported, and which tags
have disappeared.

.. note::

   Among the difference of tags, there can be tags that have severities
   lower than warning and error, but we have no way to filter them out
   without loading the full analysis.json from the artifact which would be
   much more costly for almost no gain.


For autopkgtest
---------------

We compare the result of each individual test in the ``results`` key
of the artifact metadata. Each result is classified on its own following
the table below, the first line that matches ends the classification
process:

.. list-table::
   :header-rows: 1

   * - ORIGINAL
     - NEW
     - RESULT
   * - ``*``
     - FLAKY
     - stable
   * - PASS, SKIP
     - FAIL
     - regression
   * - FAIL, FLAKY
     - PASS, SKIP
     - improvement
   * - ``*``
     - ``*``
     - stable

Each individual regression or improvement is noted and documented in the
``details`` field of the analysis.

To compute the global result of the regression analysis, the logic is the
following::

    if "regression" in comparison_of_tests:
        return "regression"
    elif "improvement" in comparison_of_tests:
        return "improvement"
    else:
        return "stable"

For piuparts and blhc
---------------------

The provided metadata do not allow for deep comparisons, so the comparison
is based on the ``result`` of the corresponding WorkRequest (which is
duplicated in the per-item data of the :collection:`debian:qa-results`
collection).

The algorithm is the following::

    if origin.result == SUCCESS and new.result == FAILURE:
        return "regression"
    elif origin.result == FAILURE and new.result == SUCCESS:
        return "improvement"
    else
        return "stable"

About the UI to display regression analysis
===========================================

Here's an example of what the table could look like:

.. list-table::
   :header-rows: 1

   * - Test name
     - Original result for dpkg_1.2.0
     - New result for dpkg_1.2.1
     - Conclusion
   * - autopkgtest:dpkg_amd64
     - ✅
     - ❌
     - ↘️  regression
   * - lintian:dpkg_source
     - ✅
     - ✅
     - ➡️  stable
   * - piuparts:dpkg_amd64
     - ✅
     - ❔
     - ❔ no-result
   * - autopkgtest:apt_amd64
     - ❌
     - ✅
     - ↗️  improvement
   * - **Summary**
     - 1 failure
     - 1 failure, 1 missing result
     - ↘️  regression

Multiple comments about the desired table:

* We should use the standard WorkRequest result widgets instead of the special
  characters (✅ and ❌) shown above.
* We want to put links to the artifact for each QA result in the "Original
  result" and "New result" columns.
* The number of autopkgtest results due to the reverse_dependencies_autopkgtest
  workflow can be overwhelming. Due to this, the autopkgtest lines that
  concern other source packages than the one processed in the current
  workflow are hidden if the regression analysis result is "stable" or
  "no-result".
* For piuparts tasks where we don't have artifacts to link, we probably
  want to link to the work request directly.