Skip to content

Commit

Permalink
Update docs for steps to take if CI fails (dotnet#32548)
Browse files Browse the repository at this point in the history
* Update docs for steps to take if CI fails

* update

* more

* more

* more

* include dumps

* more

* more

* typo
  • Loading branch information
danmoseley authored Feb 19, 2020
1 parent e4d7893 commit 9d9a55a
Showing 1 changed file with 53 additions and 12 deletions.
65 changes: 53 additions & 12 deletions docs/pr-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,21 +25,62 @@ Anyone with write access can merge a pull request manually or by setting the [au
* The PR has been approved by at least one reviewer and any other objections are addressed.
* You can request another review from the original reviewer.
* The PR successfully builds and passes all tests in the Continuous Integration (CI) system.
* You can trigger a rebuild by adding a comment like `/azp run <pipeline name>` or manually re-run only the failing lanes in Azure DevOps menu or on GitHub Checks tab clicking on "re-run failed checks" or "re-run all checks" if you want to re-run all.
* You can list the available pipelines by adding a comment like `/azp list` or get the available commands by adding a comment like `azp help`.
* Reach out to the infrastructure team for assistance on [Teams channel](https://teams.microsoft.com/l/channel/19%3ab27b36ecd10a46398da76b02f0411de7%40thread.skype/Infrastructure?groupId=014ca51d-be57-47fa-9628-a15efcc3c376&tenantId=72f988bf-86f1-41af-91ab-2d7cd011db47) (for corpnet users) or on [Gitter](https://gitter.im/dotnet/community) in other cases.
* Depending on your change, you may need to re-run validation. See [rerunning validation](#rerunning-validation) below.

Please always **squash** the pull request unless there are special circumstances. Do so, even if the PR contains only one commit. It creates a simpler history than a Merge Commit. "Special circumstances" are rare, and typically mean that there are a series of cleanly separated changes that will be too hard to understand if squashed together, or for some reason we want to preserve the ability to bisect them.

## Unrelated failure

In case CI indicates failures which are **highly unlikely** to be caused by changes in the PR, the following actions should be taken:

* An existing issue in the repository should be searched for. Usually the test method's or the test assembly's name (in case of a crash) are good parameters.
* If there's an existing issue, a comment should be placed that includes a) the link to the build, b) the affected configuration (ie `netcoreapp-Windows_NT-Release-x64-Windows.81.Amd64.Open`) and c) the Error message and Stack trace. This is necessary as retention policies are in place that recycle _old_ builds. In case the issue is already closed, it should be reopened and labels should be updated to reflect the current failure state.
* If there's no existing issue, an issue should be created with the same information outlined above.
* In a follow-up Pull Request, the failing test(s) should be disabled with the corresponding issue link, e.g. `[ActiveIssue(x)]`, and the tracking issue should be labeled as `disabled-test`.
* A comment should be placed in the original Pull Request that links to the created or updated issues.
## Rerunning Validation

Validation may fail for several reasons:

### Option 1: You have a defect in your PR

* Simply push the fix to your PR branch, and validation will start over.

### Option 2: There is a flaky test that is not related to your PR

* Your assumption should be that a failed test indicates a problem in your PR. (If we don't operate this way, chaos ensues.) If the test fails when run again, it is almost surely a failure caused by your PR. However, there are occasions where unrelated failures occur. Here's some ways to know:
* Perhaps you see the same failure in CI results for unrelated active PR's.
* It's a known issue listed in our [big tracking issue](https://github.com/dotnet/runtime/issues/702) or tagged `blocking-clean-ci` [(query here)](https://github.com/dotnet/runtime/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+label%3Ablocking-clean-ci+)
* Its otherwise beyond any reasonable doubt that your code changes could not have caused this.
* If the tests pass on rerun, that may suggest it's not related.
* In this situation, you want to re-run but not necessarily rebase on master.
* To rerun just the failed leg(s):
* Click on any leg. Navigate through the Azure DevOps UI, find the "..." button and choose "Retry failed legs"
* Or, on the GitHub Checks tab choose "re-run failed checks". This will not rebase your change.
* To rerun all validation:
* Add a comment `/azp run runtime`
* Or, click on "re-run all checks" in the GitHub Checks tab
* Or, simply close and reopen the PR.
* If you have established that it is an unrelated failure, please ensure we have an active issue for it. See the [unrelated failure](#unrelated-failure) section below.
* Whoever merges the PR should be satisfied that the failure is unrelated, is not introduced by the change, and that we are appropriately tracking it.

### Option 3: The state of the master branch HEAD is bad.

* This is the very rare case where there was a build break in master, and you got unlucky. Hopefully the break has been fixed, and you want CI to rebase your change and rerun validation.
* To rebase and rerun all validation:
* Add a comment `/azp run runtime`
* Or, click on "re-run all checks" in the GitHub Checks tab
* Or, simply close and reopen the PR.

### Additional information:
* You can list the available pipelines by adding a comment like `/azp list` or get the available commands by adding a comment like `azp help`.
* Reach out to the infrastructure team for assistance on [Teams channel](https://teams.microsoft.com/l/channel/19%3ab27b36ecd10a46398da76b02f0411de7%40thread.skype/Infrastructure?groupId=014ca51d-be57-47fa-9628-a15efcc3c376&tenantId=72f988bf-86f1-41af-91ab-2d7cd011db47) (for corpnet users) or on [Gitter](https://gitter.im/dotnet/community) in other cases.

## What to do if you determine the failure is unrelated

If you have determined the failure is definitely not caused by changes in your PR, please do this:

* Search for an [existing issue](https://github.com/dotnet/runtime/issues). Usually the test method name or (if a crash/hang) the test assembly name are good search parameters.
* If there's an existing issue, add a comment with
* a) the link to the build
* b) the affected configuration (ie `netcoreapp-Windows_NT-Release-x64-Windows.81.Amd64.Open`)
* c) all console output including the error message and stack trace from the Azure DevOps tab (This is necessary as retention policies are in place that recycle old builds.)
* d) if there's a dump file (see Attachments tab in Azure DevOps) include that
* If the issue is already closed, reopen it and update the labels to reflect the current failure state.
* If there's no existing issue, create an issue with the same information listed above.
* Update the original pull request with a comment linking to the new or existing issue.
* In a follow-up Pull Request, disable the failing test(s) with the corresponding issue link, e.g. `[ActiveIssue(x)]`, and update the tracking issue with the label `disabled-test`.

There are plenty of possible bugs, e.g. race conditions, where a failure might highlight a real problem and it won't manifest again on a retry. Therefore these steps should be followed for every iteration of the PR build, e.g. before retrying/rebuilding.

Expand Down

0 comments on commit 9d9a55a

Please sign in to comment.