Update docs for steps to take if CI fails (dotnet#32548)

* Update docs for steps to take if CI fails * update * more * more * more * include dumps * more * more * typo
aLuxry · Feb 19, 2020 · 9d9a55a · 9d9a55a
1 parent e4d7893
commit 9d9a55a
Showing 1 changed file with 53 additions and 12 deletions.
diff --git a/docs/pr-guide.md b/docs/pr-guide.md
@@ -25,21 +25,62 @@ Anyone with write access can merge a pull request manually or by setting the [au
 * The PR has been approved by at least one reviewer and any other objections are addressed.
     * You can request another review from the original reviewer.
 * The PR successfully builds and passes all tests in the Continuous Integration (CI) system.
-    * You can trigger a rebuild by adding a comment like `/azp run <pipeline name>` or manually re-run only the failing lanes in Azure DevOps menu or on GitHub Checks tab clicking on "re-run failed checks" or "re-run all checks" if you want to re-run all.
-    * You can list the available pipelines by adding a comment like `/azp list` or get the available commands by adding a comment like `azp help`.
-    * Reach out to the infrastructure team for assistance on [Teams channel](https://teams.microsoft.com/l/channel/19%3ab27b36ecd10a46398da76b02f0411de7%40thread.skype/Infrastructure?groupId=014ca51d-be57-47fa-9628-a15efcc3c376&tenantId=72f988bf-86f1-41af-91ab-2d7cd011db47) (for corpnet users) or on [Gitter](https://gitter.im/dotnet/community) in other cases.
+    * Depending on your change, you may need to re-run validation. See [rerunning validation](#rerunning-validation) below.
 
 Please always **squash** the pull request unless there are special circumstances. Do so, even if the PR contains only one commit. It creates a simpler history than a Merge Commit. "Special circumstances" are rare, and typically mean that there are a series of cleanly separated changes that will be too hard to understand if squashed together, or for some reason we want to preserve the ability to bisect them.
 
-## Unrelated failure
-
-In case CI indicates failures which are **highly unlikely** to be caused by changes in the PR, the following actions should be taken:
-
-* An existing issue in the repository should be searched for. Usually the test method's or the test assembly's name (in case of a crash) are good parameters.
-* If there's an existing issue, a comment should be placed that includes a) the link to the build, b) the affected configuration (ie `netcoreapp-Windows_NT-Release-x64-Windows.81.Amd64.Open`) and c) the Error message and Stack trace. This is necessary as retention policies are in place that recycle _old_ builds. In case the issue is already closed, it should be reopened and labels should be updated to reflect the current failure state. 
-* If there's no existing issue, an issue should be created with the same information outlined above.
-* In a follow-up Pull Request, the failing test(s) should be disabled with the corresponding issue link, e.g. `[ActiveIssue(x)]`, and the tracking issue should be labeled as `disabled-test`.
-* A comment should be placed in the original Pull Request that links to the created or updated issues.
+## Rerunning Validation
+
+Validation may fail for several reasons:
+
+### Option 1: You have a defect in your PR
+
+* Simply push the fix to your PR branch, and validation will start over.
+
+### Option 2: There is a flaky test that is not related to your PR
+
+* Your assumption should be that a failed test indicates a problem in your PR. (If we don't operate this way, chaos ensues.) If the test fails when run again, it is almost surely a failure caused by your PR. However, there are occasions where unrelated failures occur. Here's some ways to know:
+  * Perhaps you see the same failure in CI results for unrelated active PR's.
+  * It's a known issue listed in our [big tracking issue](https://github.com/dotnet/runtime/issues/702) or tagged `blocking-clean-ci` [(query here)](https://github.com/dotnet/runtime/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+label%3Ablocking-clean-ci+)
+  * Its otherwise beyond any reasonable doubt that your code changes could not have caused this.
+  * If the tests pass on rerun, that may suggest it's not related.
+* In this situation, you want to re-run but not necessarily rebase on master.
+  * To rerun just the failed leg(s):
+    * Click on any leg. Navigate through the Azure DevOps UI, find the "..." button and choose "Retry failed legs"
+    * Or, on the GitHub Checks tab choose "re-run failed checks". This will not rebase your change.
+  * To rerun all validation:
+    * Add a comment `/azp run runtime`
+    * Or, click on "re-run all checks" in the GitHub Checks tab
+    * Or, simply close and reopen the PR.
+* If you have established that it is an unrelated failure, please ensure we have an active issue for it. See the [unrelated failure](#unrelated-failure) section below.
+* Whoever merges the PR should be satisfied that the failure is unrelated, is not introduced by the change, and that we are appropriately tracking it.
+
+### Option 3: The state of the master branch HEAD is bad.
+
+* This is the very rare case where there was a build break in master, and you got unlucky. Hopefully the break has been fixed, and you want CI to rebase your change and rerun validation.
+* To rebase and rerun all validation:
+  * Add a comment `/azp run runtime`
+  * Or, click on "re-run all checks" in the GitHub Checks tab
+  * Or, simply close and reopen the PR.
+
+### Additional information:
+  * You can list the available pipelines by adding a comment like `/azp list` or get the available commands by adding a comment like `azp help`.
+  * Reach out to the infrastructure team for assistance on [Teams channel](https://teams.microsoft.com/l/channel/19%3ab27b36ecd10a46398da76b02f0411de7%40thread.skype/Infrastructure?groupId=014ca51d-be57-47fa-9628-a15efcc3c376&tenantId=72f988bf-86f1-41af-91ab-2d7cd011db47) (for corpnet users) or on [Gitter](https://gitter.im/dotnet/community) in other cases.
+
+## What to do if you determine the failure is unrelated
+
+If you have determined the failure is definitely not caused by changes in your PR, please do this:
+
+* Search for an [existing issue](https://github.com/dotnet/runtime/issues). Usually the test method name or (if a crash/hang) the test assembly name are good search parameters.
+  * If there's an existing issue, add a comment with
+    * a) the link to the build
+    * b) the affected configuration (ie `netcoreapp-Windows_NT-Release-x64-Windows.81.Amd64.Open`)
+    * c) all console output including the error message and stack trace from the Azure DevOps tab (This is necessary as retention policies are in place that recycle old builds.)
+    * d) if there's a dump file (see Attachments tab in Azure DevOps) include that
+    * If the issue is already closed, reopen it and update the labels to reflect the current failure state.
+  * If there's no existing issue, create an issue with the same information listed above.
+  * Update the original pull request with a comment linking to the new or existing issue.
+* In a follow-up Pull Request, disable the failing test(s) with the corresponding issue link, e.g. `[ActiveIssue(x)]`, and update the tracking issue with the label `disabled-test`.
 
 There are plenty of possible bugs, e.g. race conditions, where a failure might highlight a real problem and it won't manifest again on a retry. Therefore these steps should be followed for every iteration of the PR build, e.g. before retrying/rebuilding.