Fix a grep-itself bug when checking for GPU healthcheck (pytorch#97929)

The logic works https://github.com/pytorch/pytorch/actions/runs/4558327458, but it also grep itself due to `set -x` is set (Ugh, debug message) Pull Request resolved: pytorch#97929 Approved by: https://github.com/malfet, https://github.com/weiwangmeta
ChunQingGuiJiang · Mar 30, 2023 · f92cae4 · f92cae4
1 parent b093dfa
commit f92cae4
Showing 1 changed file with 14 additions and 40 deletions.
diff --git a/.github/workflows/_linux-test.yml b/.github/workflows/_linux-test.yml
@@ -272,8 +272,21 @@ jobs:
         uses: pytorch/test-infra/.github/actions/teardown-linux@main
         if: always()
 
+      # NB: We are currently having an intermittent GPU-related issue on G5 runners with
+      # A10G GPU. Once this happens, trying to reset the GPU as done in setup-nvidia does
+      # not seem to help. Here are some symptoms:
+      #   * Calling nvidia-smi timeouts after 60 second
+      #   * Fail to run nvidia-smi with an unable to determine the device handle for GPU
+      #     unknown error
+      #   * Test fails with a missing CUDA GPU error when initializing CUDA in PyTorch
+      #   * Run docker --gpus all fails with error response from daemon
+      #
+      # As both the root cause and recovery path are unclear, let's take the runner out of
+      # service so that it doesn't get any more jobs
       - name: Check NVIDIA driver installation step
-        if: failure() && steps.install-nvidia-driver.conclusion && steps.install-nvidia-driver.conclusion == 'failure'
+        if:
+          failure() &&
+          ((steps.install-nvidia-driver.conclusion && steps.install-nvidia-driver.conclusion == 'failure') || (contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu')))
         shell: bash
         env:
           RUNNER_WORKSPACE: ${{ runner.workspace }}
@@ -289,42 +302,3 @@ jobs:
             echo "NVIDIA driver installation has failed, shutting down the runner..."
             .github/scripts/stop_runner_service.sh
           fi
-
-      - name: Check GPU health (run this last)
-        if: failure() && contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu')
-        shell: bash
-        env:
-          RUNNER_WORKSPACE: ${{ runner.workspace }}
-        run: |
-          set +e
-          set -x
-
-          # NB: We are currently having an intermittent GPU-related issue on G5 runners with
-          # A10G GPU. Once this happens, trying to reset the GPU as done in setup-nvidia does
-          # not seem to help. Here are some symptoms:
-          #   * Calling nvidia-smi timeouts after 60 second
-          #   * Fail to run nvidia-smi with an unable to determine the device handle for GPU
-          #     unknown error
-          #   * Test fails with "No CUDA GPUs are available" error when initialize CUDA
-          #     in PyTorch
-          #   * Run docker --gpus all fails with error response from daemon while the command
-          #     nvidia-container-cli fails with detection error: nvml error: unknown error.
-          #
-          # As both the root cause and recovery path are unclear, let's take the runner out of
-          # service so that it doesn't get any more jobs
-          UNRECOVERABLE_ERRORS=(
-            "No CUDA GPUs are available"
-            "docker: Error response from daemon"
-          )
-
-          for ERROR in "${UNRECOVERABLE_ERRORS[@]}"
-          do
-            grep -Rli "${ERROR}" "${RUNNER_WORKSPACE}/../../_diag/pages"
-            RC=$?
-
-            # If GPU crashes, stop the runner to prevent it from receiving new jobs
-            if [[ "${RC}" == "0" ]]; then
-              echo "The runner has encoutered an unrecoverable error (${ERROR}), shutting it down..."
-              .github/scripts/stop_runner_service.sh
-            fi
-          done