Skip to content

Commit

Permalink
Fix a grep-itself bug when checking for GPU healthcheck (pytorch#97929)
Browse files Browse the repository at this point in the history
The logic works https://github.com/pytorch/pytorch/actions/runs/4558327458, but it also grep itself due to `set -x` is set (Ugh, debug message)
Pull Request resolved: pytorch#97929
Approved by: https://github.com/malfet, https://github.com/weiwangmeta
  • Loading branch information
huydhn authored and pytorchmergebot committed Mar 30, 2023
1 parent b093dfa commit f92cae4
Showing 1 changed file with 14 additions and 40 deletions.
54 changes: 14 additions & 40 deletions .github/workflows/_linux-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -272,8 +272,21 @@ jobs:
uses: pytorch/test-infra/.github/actions/teardown-linux@main
if: always()

# NB: We are currently having an intermittent GPU-related issue on G5 runners with
# A10G GPU. Once this happens, trying to reset the GPU as done in setup-nvidia does
# not seem to help. Here are some symptoms:
# * Calling nvidia-smi timeouts after 60 second
# * Fail to run nvidia-smi with an unable to determine the device handle for GPU
# unknown error
# * Test fails with a missing CUDA GPU error when initializing CUDA in PyTorch
# * Run docker --gpus all fails with error response from daemon
#
# As both the root cause and recovery path are unclear, let's take the runner out of
# service so that it doesn't get any more jobs
- name: Check NVIDIA driver installation step
if: failure() && steps.install-nvidia-driver.conclusion && steps.install-nvidia-driver.conclusion == 'failure'
if:
failure() &&
((steps.install-nvidia-driver.conclusion && steps.install-nvidia-driver.conclusion == 'failure') || (contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu')))
shell: bash
env:
RUNNER_WORKSPACE: ${{ runner.workspace }}
Expand All @@ -289,42 +302,3 @@ jobs:
echo "NVIDIA driver installation has failed, shutting down the runner..."
.github/scripts/stop_runner_service.sh
fi
- name: Check GPU health (run this last)
if: failure() && contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu')
shell: bash
env:
RUNNER_WORKSPACE: ${{ runner.workspace }}
run: |
set +e
set -x
# NB: We are currently having an intermittent GPU-related issue on G5 runners with
# A10G GPU. Once this happens, trying to reset the GPU as done in setup-nvidia does
# not seem to help. Here are some symptoms:
# * Calling nvidia-smi timeouts after 60 second
# * Fail to run nvidia-smi with an unable to determine the device handle for GPU
# unknown error
# * Test fails with "No CUDA GPUs are available" error when initialize CUDA
# in PyTorch
# * Run docker --gpus all fails with error response from daemon while the command
# nvidia-container-cli fails with detection error: nvml error: unknown error.
#
# As both the root cause and recovery path are unclear, let's take the runner out of
# service so that it doesn't get any more jobs
UNRECOVERABLE_ERRORS=(
"No CUDA GPUs are available"
"docker: Error response from daemon"
)
for ERROR in "${UNRECOVERABLE_ERRORS[@]}"
do
grep -Rli "${ERROR}" "${RUNNER_WORKSPACE}/../../_diag/pages"
RC=$?
# If GPU crashes, stop the runner to prevent it from receiving new jobs
if [[ "${RC}" == "0" ]]; then
echo "The runner has encoutered an unrecoverable error (${ERROR}), shutting it down..."
.github/scripts/stop_runner_service.sh
fi
done

0 comments on commit f92cae4

Please sign in to comment.