Skip to content

Commit

Permalink
[CI] Make bazel sharding for parallel buildkite more intelligent (ray…
Browse files Browse the repository at this point in the history
…-project#29221)

This PR implements two changes to our `bazel-sharding.py` script, used for determining which bazel tests to run on each instance when buildkite parallelism is used:
* An ability to filter tests before they are sharded, using the same logic as `bazel test`. This is done by specifying the `--tag_filters` argument, eg. `--tag_filters=air,-gpu`. If we filter tests with `bazel test` *after* they are sharded, we can end up with imbalanced shards as eg. all tests we want to filter out are assigned to one shard. This feature is enabled for Serve tests and it will be required for the changes I want to make to AIR CI.
* A new algorithm to balance the shards, finally implementing what that comment was asking for all this time. Instead of trying to assign the same number of tests (which have variable timeouts) to each shard, the new algorithm tries to make sure each shard will finish in more or less the same time. This is achieved through a simple but good enough heuristic. The old algorithm can still be accessed through the `--sharding_strategy` argument.

Those two changes do cause the complexity of the script to increase, necessitating proper testing. In order to facilitate that, this PR also adds a basic buildkite test harness for CI tools/scripts.

After this PR is merged, the next step will be to move most of our manually parallelized jobs to use buildkite parallelism with the new logic here.

Signed-off-by: Antoni Baum <[email protected]>
  • Loading branch information
Yard1 authored Oct 14, 2022
1 parent 4f07b61 commit d1aa560
Show file tree
Hide file tree
Showing 11 changed files with 787 additions and 121 deletions.
3 changes: 2 additions & 1 deletion .buildkite/pipeline.build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -284,9 +284,10 @@
- ./ci/env/env_info.sh
- >-
set -x;
python ./ci/run/bazel-sharding.py
python ./ci/run/bazel_sharding/bazel_sharding.py
--exclude_manual
--index "\${BUILDKITE_PARALLEL_JOB}" --count "\${BUILDKITE_PARALLEL_JOB_COUNT}"
--tag_filters=-post_wheel_build,-py37,-gpu
python/ray/serve/...
> test_shard.txt
- cat test_shard.txt
Expand Down
9 changes: 9 additions & 0 deletions .buildkite/pipeline.test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,12 @@
- bazel test --config=ci $(./scripts/bazel_export_options)
--test_tag_filters=ray_dag_tests
python/ray/dag/...

- label: ":toolbox: CI Tools"
conditions: ["RAY_CI_TOOLS_AFFECTED"]
instance_size: small
commands:
- cleanup() { if [ "${BUILDKITE_PULL_REQUEST}" = "false" ]; then ./ci/build/upload_build_info.sh; fi }; trap cleanup EXIT
- ./ci/env/install-dependencies.sh
- ./ci/env/env_info.sh
- bazel test --config=ci $(./ci/run/bazel_export_options) --build_tests_only ci/run/bazel_sharding/...
2 changes: 1 addition & 1 deletion ci/ci.sh
Original file line number Diff line number Diff line change
Expand Up @@ -208,7 +208,7 @@ test_python() {
# Shard the args.
BUILDKITE_PARALLEL_JOB=${BUILDKITE_PARALLEL_JOB:-'0'}
BUILDKITE_PARALLEL_JOB_COUNT=${BUILDKITE_PARALLEL_JOB_COUNT:-'1'}
test_shard_selection=$(python ./ci/run/bazel-sharding.py --exclude_manual --index "${BUILDKITE_PARALLEL_JOB}" --count "${BUILDKITE_PARALLEL_JOB_COUNT}" "${args[@]}")
test_shard_selection=$(python ./ci/run/bazel_sharding/bazel_sharding.py --exclude_manual --index "${BUILDKITE_PARALLEL_JOB}" --count "${BUILDKITE_PARALLEL_JOB_COUNT}" "${args[@]}")

# TODO(mehrdadn): We set PYTHONPATH here to let Python find our pickle5 under pip install -e.
# It's unclear to me if this should be necessary, but this is to make tests run for now.
Expand Down
16 changes: 12 additions & 4 deletions ci/pipeline/determine_tests_to_run.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
import subprocess
import sys
from pprint import pformat
import traceback


# NOTE(simon): do not add type hint here because it's ran using python2 in CI.
Expand Down Expand Up @@ -98,6 +99,7 @@ def get_commit_range():
RAY_CI_DOCKER_AFFECTED = 0
RAY_CI_DOC_AFFECTED = 0
RAY_CI_PYTHON_DEPENDENCIES_AFFECTED = 0
RAY_CI_TOOLS_AFFECTED = 0

if is_pull_request():
commit_range = get_commit_range()
Expand All @@ -122,9 +124,9 @@ def get_commit_range():
print("RLlib tests impacted: ", len(impacted), file=sys.stderr)
for test in impacted.keys():
print(" ", test, file=sys.stderr)
except Exception as e:
except Exception:
print("Failed to dry run py_dep_analysis.py", file=sys.stderr)
print(e, file=sys.stderr)
traceback.print_exc(file=sys.stderr)
# End of dry run.

skip_prefix_list = [
Expand Down Expand Up @@ -217,13 +219,16 @@ def get_commit_range():
pass
elif changed_file.startswith("ci/lint"):
# Linter will always be run
pass
RAY_CI_TOOLS_AFFECTED = 1
elif changed_file.startswith("ci/pipeline"):
# These scripts are always run as part of the build process
pass
RAY_CI_TOOLS_AFFECTED = 1
elif changed_file.endswith("build-docker-images.py"):
RAY_CI_DOCKER_AFFECTED = 1
RAY_CI_LINUX_WHEELS_AFFECTED = 1
RAY_CI_TOOLS_AFFECTED = 1
elif changed_file.startswith("ci/run"):
RAY_CI_TOOLS_AFFECTED = 1
elif changed_file.startswith("src/"):
RAY_CI_ML_AFFECTED = 1
RAY_CI_TUNE_AFFECTED = 1
Expand Down Expand Up @@ -259,6 +264,7 @@ def get_commit_range():
RAY_CI_LINUX_WHEELS_AFFECTED = 1
RAY_CI_MACOS_WHEELS_AFFECTED = 1
RAY_CI_DASHBOARD_AFFECTED = 1
RAY_CI_TOOLS_AFFECTED = 1
else:
RAY_CI_ML_AFFECTED = 1
RAY_CI_TUNE_AFFECTED = 1
Expand All @@ -274,6 +280,7 @@ def get_commit_range():
RAY_CI_LINUX_WHEELS_AFFECTED = 1
RAY_CI_MACOS_WHEELS_AFFECTED = 1
RAY_CI_DASHBOARD_AFFECTED = 1
RAY_CI_TOOLS_AFFECTED = 1

# Log the modified environment variables visible in console.
output_string = " ".join(
Expand All @@ -297,6 +304,7 @@ def get_commit_range():
"RAY_CI_PYTHON_DEPENDENCIES_AFFECTED={}".format(
RAY_CI_PYTHON_DEPENDENCIES_AFFECTED
),
"RAY_CI_TOOLS_AFFECTED={}".format(RAY_CI_TOOLS_AFFECTED),
]
)

Expand Down
114 changes: 0 additions & 114 deletions ci/run/bazel-sharding.py

This file was deleted.

15 changes: 15 additions & 0 deletions ci/run/bazel_sharding/BUILD
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
py_test(
name = "test_bazel_sharding",
size = "small",
srcs = ["tests/test_bazel_sharding.py"],
data = ["tests/mock_BUILD", "bazel_sharding.py"],
tags = ["team:ml"],
deps = [":bazel_sharding"]
)

# This is a dummy test dependency that causes the above tests to be
# re-run if any of these files changes.
py_library(
name = "bazel_sharding",
srcs = glob(["**/*.py"], exclude=["tests/*"]),
)
Loading

0 comments on commit d1aa560

Please sign in to comment.