Add more selective provider tests (apache#24666)

After implementing apache#24610 and few follow-up fixes, it is now easy to add more optimizations to our unit test execution in CI (and to give this capability back to our contributors). This PR adds capability of running tests for selected set of providers - not for the whole "Providers" group. You can specify `--test-type "Providers[airbyte,http]" to only run tests for the two selected providers. This is the step towards separating providers to separate repositories, but it also allows to optimize the experience of the contributors developing only single provider changes (which is vast majority of contributions). This also allows to optimize build and elapsed time needd to run tests for those PRs that only affects selected providers (again - vast majority of PRs). The CI selection of which provider tests is done now in Selective Checkcs - they are a bit smarter in just selecting the providers that has been changed, they also check if there are any other providers that depend on it (we keep automatically updated by pre-commit dependencies.json file and this file determines which files should be run.
Jorricks · Jul 6, 2022 · 3dedbd3 · 3dedbd3
1 parent ff7e7a0
commit 3dedbd3
Show file tree

Hide file tree

Showing 10 changed files with 311 additions and 112 deletions.
diff --git a/BREEZE.rst b/BREEZE.rst
@@ -500,9 +500,6 @@ Airflow Breeze is a bash script serving as a "swiss-army-knife" of Airflow testi
 hood it uses other scripts that you can also run manually if you have problem with running the Breeze
 environment. Breeze script allows performing the following tasks:
 
-Development tasks
------------------
-
 Those are commands mostly used by contributors:
 
 * Execute arbitrary command in the test environment with ``breeze shell`` command
@@ -512,6 +509,7 @@ Those are commands mostly used by contributors:
 * Initialize local virtualenv with ``./scripts/tools/initialize_virtualenv.py`` command
 * Run static checks with autocomplete support ``breeze static-checks`` command
 * Run test specified with ``breeze tests`` command
+* Run docker-compose tests with ``breeze docker-compose-tests`` command.
 * Build CI docker image with ``breeze build-image`` command
 * Cleanup breeze with ``breeze cleanup`` command
 
@@ -524,8 +522,53 @@ Additional management tasks:
 Tests
 -----
 
-* Run docker-compose tests with ``breeze docker-compose-tests`` command.
-* Run test specified with ``breeze tests`` command.
+You can regular unit tests with ``breeze`` in two different ways, either interactively run tests with
+the default ``shell`` command or via the ``tests`` command.
+
+Iterate on tests interactively
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You can simply enter the ``breeze`` container and run ``pytest`` command there. You can enter the
+container via just ``breeze`` command or ``breeze shell`` command (the latter has more options
+useful when you run integration or system tests). This is the best way if you want to interactively
+run selected tests and iterate with the tests. Once you enter ``breeze`` environment it is ready
+out-of-the-box to run your tests by running the right ``pytest`` command (autocomplete should help
+you with autocompleting test name if you start typing ``pytest tests<TAB>``).
+
+Here are few examples:
+
+Running single test:
+
+.. code-block:: bash
+
+    pytest tests/core/test_core.py::TestCore::test_check_operators
+
+To run the whole test class:
+
+.. code-block:: bash
+
+    pytest tests/core/test_core.py::TestCore
+
+You can re-run the tests interactively, add extra parameters to pytest and modify the files before
+re-running the test to iterate over the tests. You can also add more flags when starting the
+``breeze shell`` command when you run integration tests or system tests. Read more details about it
+in the ``TESTING.rst <TESTING.rst#>`` where all the test types of our are explained and more information
+on how to run them.
+
+Running group of tests
+~~~~~~~~~~~~~~~~~~~~~~
+
+You can also run tests via built-in ``breeze tests`` command - similarly as iterative ``pytest`` command
+allows to run test individually, or by class or in any other way pytest allows to test them, but it
+also allows to run the tests in the same test "types" that are used to run the tests in CI: Core, Always
+API, Providers. This how our CI runs them - running each group in parallel to other groups and you can
+replicate this behaviour.
+
+Another interesting use of the ``breeze tests`` command is that you can easily specify sub-set of the
+tests for Providers. ``breeze tests --test-type "Providers[airbyte,http]`` for example will only run
+tests for airbyte and http providers.
+
+Here is the detailed set of options for the ``breeze tests`` command.
 
 .. image:: ./images/breeze/output-tests.svg
   :width: 100%

diff --git a/Dockerfile.ci b/Dockerfile.ci
@@ -952,13 +952,18 @@ else
             ${TEST_TYPE} == "Long" || \
             ${TEST_TYPE} == "Integration" ]]; then
         SELECTED_TESTS=("${ALL_TESTS[@]}")
+    elif [[ ${TEST_TYPE} =~ Providers\[(.*)\] ]]; then
+        SELECTED_TESTS=()
+        for provider in ${BASH_REMATCH[1]//,/ }
+        do
+            SELECTED_TESTS+=("tests/providers/${provider//./\/}")
+        done
     else
         echo
         echo  "${COLOR_RED}ERROR: Wrong test type ${TEST_TYPE}  ${COLOR_RESET}"
         echo
         exit 1
     fi
-
 fi
 readonly SELECTED_TESTS CLI_TESTS API_TESTS PROVIDERS_TESTS CORE_TESTS WWW_TESTS \
     ALL_TESTS ALL_PRESELECTED_TESTS

diff --git a/TESTING.rst b/TESTING.rst
@@ -109,7 +109,8 @@ To run unit tests from the Visual Studio Code:
     :alt: Running tests
 
 Running Unit Tests
---------------------------------
+------------------
+
 To run unit, integration, and system tests from the Breeze and your
 virtualenv, you can use the `pytest <http://doc.pytest.org/en/latest/>`_ framework.
 
@@ -188,6 +189,18 @@ You can also limit the tests to execute to specific group of tests
 
     breeze tests --test-type Core
 
+In case of Providers tests, you can run tests for all providers
+
+.. code-block:: bash
+
+    breeze tests --test-type Providers
+
+You can also limit the set of providers you would like to run tests of
+
+.. code-block:: bash
+
+    breeze tests --test-type "Providers[airbyte,http]"
+
 
 You can also write tests in "limited progress" mode (useful in the future to run CI). In this mode each
 test just prints "percentage" summary of the run as single line and only dumps full output of the test

diff --git a/dev/breeze/SELECTIVE_CHECKS.md b/dev/breeze/SELECTIVE_CHECKS.md
@@ -87,6 +87,14 @@ The logic implements the following rules:
 * If no Source files are changed - no tests are run and no further rules below are checked.
 * `Image building` is enabled if either test are run, docs are build or kubernetes tests are run. All those
   need `CI` or `PROD` images to be built.
+* In case of `Providers` test in regular PRs, additional check is done in order to determine which
+  providers are affected and the actual selection is made based on that:
+  * if directly provider code is changed (either in the provider, test or system tests) then this provider
+    is selected.
+  * if there are any providers that depend on the affected providers, they are also included in the list
+    of affected providers (but not recursively - only direct dependencies are added)
+  * if there are any changes to "common" provider code not belonging to any provider (usually system tests
+    or tests), then tests for all Providers are run
 * The specific unit test type is enabled only if changed files match the expected patterns for each type
   (`API`, `CLI`, `WWW`, `Providers`). The `Always` test type is added always if any unit tests are run.
   `Providers` tests are removed if current branch is different than `main`

diff --git a/dev/breeze/src/airflow_breeze/commands/testing_commands.py b/dev/breeze/src/airflow_breeze/commands/testing_commands.py
@@ -48,7 +48,7 @@
     option_verbose,
 )
 from airflow_breeze.utils.console import get_console, message_type_from_return_code
-from airflow_breeze.utils.custom_param_types import BetterChoice
+from airflow_breeze.utils.custom_param_types import NotVerifiedBetterChoice
 from airflow_breeze.utils.docker_command_utils import (
     get_env_variables_for_docker_commands,
     perform_environment_checks,
@@ -247,9 +247,10 @@ def run_with_progress(
 @option_mount_sources
 @click.option(
     "--test-type",
-    help="Type of test to run.",
+    help="Type of test to run. Note that with Providers, you can also specify which provider "
+    "tests should be run - for example --test-type \"Providers[airbyte,http]\"",
     default="All",
-    type=BetterChoice(ALLOWED_TEST_TYPE_CHOICES),
+    type=NotVerifiedBetterChoice(ALLOWED_TEST_TYPE_CHOICES),
 )
 @option_db_reset
 @click.argument('extra_pytest_args', nargs=-1, type=click.UNPROCESSED)
@@ -272,6 +273,9 @@ def tests(
     os.environ["RUN_TESTS"] = "true"
     if test_type:
         os.environ["TEST_TYPE"] = test_type
+        if "[" in test_type and not test_type.startswith("Providers"):
+            get_console().print("[error]Only 'Providers' test type can specify actual tests with \\[\\][/]")
+            sys.exit(1)
     if integration:
         if "trino" in integration:
             integration = integration + ("kerberos",)

diff --git a/dev/breeze/src/airflow_breeze/utils/selective_checks.py b/dev/breeze/src/airflow_breeze/utils/selective_checks.py
@@ -17,12 +17,19 @@
 
 from __future__ import annotations
 
+import json
+import os
 import sys
 from enum import Enum
 
+from rich.markup import escape
+
+from airflow_breeze.utils.path_utils import AIRFLOW_SOURCES_ROOT
+
 if sys.version_info >= (3, 8):
     from functools import cached_property
 else:
+    # noinspection PyUnresolvedReferences
     from cached_property import cached_property
 
 from functools import lru_cache
@@ -59,7 +66,7 @@
 def get_ga_output(name: str, value: Any) -> str:
     output_name = name.replace('_', '-')
     printed_value = str(value).lower() if isinstance(value, bool) else value
-    get_console().print(f"[info]{output_name}[/] = [green]{printed_value}[/]")
+    get_console().print(f"[info]{output_name}[/] = [green]{escape(str(printed_value))}[/]")
     return f"::set-output name={output_name}::{printed_value}"
 
 
@@ -127,6 +134,7 @@ def __hash__(self):
             r"^airflow/.*\.py$",
             r"^chart",
             r"^providers",
+            r"^tests/system",
             r"^CHANGELOG\.txt",
             r"^airflow/config_templates/config\.yml",
             r"^chart/RELEASE_NOTES\.txt",
@@ -148,6 +156,7 @@ def __hash__(self):
             r"^kubernetes_tests",
             r"^airflow/providers/cncf/kubernetes/",
             r"^tests/providers/cncf/kubernetes/",
+            r"^tests/system/providers/cncf/kubernetes/",
         ],
         FileGroupForCi.ALL_PYTHON_FILES: [
             r"\.py$",
@@ -178,11 +187,61 @@ def __hash__(self):
         SelectiveUnitTestTypes.PROVIDERS: [
             "^airflow/providers/",
             "^tests/providers/",
+            "^tests/system/",
         ],
         SelectiveUnitTestTypes.WWW: ["^airflow/www", "^tests/www", "^airflow/ui"],
     }
 )
 
+TESTS_PROVIDERS_ROOT = AIRFLOW_SOURCES_ROOT / "tests" / "providers"
+SYSTEM_TESTS_PROVIDERS_ROOT = AIRFLOW_SOURCES_ROOT / "tests" / "system" / "providers"
+AIRFLOW_PROVIDERS_ROOT = AIRFLOW_SOURCES_ROOT / "airflow" / "providers"
+
+
+def find_provider_affected(changed_file: str) -> str | None:
+    file_path = AIRFLOW_SOURCES_ROOT / changed_file
+    # is_relative_to is only available in Python 3.9 - we should simplify this check when we are Python 3.9+
+    for provider_root in (TESTS_PROVIDERS_ROOT, SYSTEM_TESTS_PROVIDERS_ROOT, AIRFLOW_PROVIDERS_ROOT):
+        try:
+            file_path.relative_to(provider_root)
+            relative_base_path = provider_root
+            break
+        except ValueError:
+            pass
+    else:
+        return None
+
+    for parent_dir_path in file_path.parents:
+        if parent_dir_path == relative_base_path:
+            break
+        relative_path = parent_dir_path.relative_to(relative_base_path)
+        if (AIRFLOW_PROVIDERS_ROOT / relative_path / "provider.yaml").exists():
+            return str(parent_dir_path.relative_to(relative_base_path)).replace(os.sep, ".")
+    # If we got here it means that some "common" files were modified. so we need to test all Providers
+    return "Providers"
+
+
+def add_dependent_providers(
+    providers: set[str], provider_to_check: str, dependencies: dict[str, dict[str, list[str]]]
+):
+    for provider, provider_info in dependencies.items():
+        if provider_to_check in provider_info['cross-providers-deps']:
+            providers.add(provider)
+
+
+def find_all_providers_affected(changed_files: tuple[str, ...]) -> set[str]:
+    all_providers: set[str] = set()
+    for changed_file in changed_files:
+        provider = find_provider_affected(changed_file)
+        if provider == "Providers":
+            return set()
+        if provider is not None:
+            all_providers.add(provider)
+    dependencies = json.loads((AIRFLOW_SOURCES_ROOT / "generated" / "provider_dependencies.json").read_text())
+    for provider in list(all_providers):
+        add_dependent_providers(all_providers, provider, dependencies)
+    return all_providers
+
 
 class SelectiveChecks:
     __HASHABLE_FIELDS = {'_files', '_default_branch', '_commit_ref', "_pr_labels", "_github_event"}
@@ -446,6 +505,11 @@ def _get_test_types_to_run(self) -> list[str]:
             get_console().print(remaining_files)
             candidate_test_types.update(all_selective_test_types())
         else:
+            if "Providers" in candidate_test_types:
+                affected_providers = find_all_providers_affected(changed_files=self._files)
+                if len(affected_providers) != 0:
+                    candidate_test_types.remove("Providers")
+                    candidate_test_types.add(f"Providers[{','.join(sorted(affected_providers))}]")
             get_console().print(
                 "[warning]There are no core/other files. Only tests relevant to the changed files are run.[/]"
             )
@@ -459,22 +523,25 @@ def test_types(self) -> str:
         if not self.run_tests:
             return ""
         if self._run_everything:
-            current_test_types = list(all_selective_test_types())
+            current_test_types = set(all_selective_test_types())
         else:
-            current_test_types = self._get_test_types_to_run()
+            current_test_types = set(self._get_test_types_to_run())
         if self._default_branch != "main":
-            if "Providers" in current_test_types:
-                get_console().print(
-                    "[warning]Removing 'Providers' because the target branch "
-                    f"is {self._default_branch} and not main[/]"
-                )
-                current_test_types.remove("Providers")
+            test_types_to_remove: set[str] = set()
+            for test_type in current_test_types:
+                if test_type.startswith("Providers"):
+                    get_console().print(
+                        f"[warning]Removing {test_type} because the target branch "
+                        f"is {self._default_branch} and not main[/]"
+                    )
+                    test_types_to_remove.add(test_type)
             if "Integration" in current_test_types:
                 get_console().print(
                     "[warning]Removing 'Integration' because the target branch "
                     f"is {self._default_branch} and not main[/]"
                 )
-                current_test_types.remove("Integration")
+                test_types_to_remove.add("Integration")
+            current_test_types = current_test_types - test_types_to_remove
         return " ".join(sorted(current_test_types))
 
     @cached_property