Skip to content

Commit

Permalink
Conversion of Drake rst documentation to markdown format.
Browse files Browse the repository at this point in the history
* This is a minimal translation of the existing documentation,
  in preparation for adding Jekyll scaffolding for the new
  website.

* Transcribed for PR by ggould-tri ([email protected])

* Add empty index.md file required by jekyll template
  • Loading branch information
BetsyMcPhail authored and ggould-tri committed Feb 11, 2021
1 parent 7d8a7c1 commit 777937d
Show file tree
Hide file tree
Showing 38 changed files with 1,499 additions and 2,162 deletions.
285 changes: 138 additions & 147 deletions doc/bazel.md

Large diffs are not rendered by default.

120 changes: 49 additions & 71 deletions doc/buildcop.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,23 @@
.. _build_cop:
---
title: Build Cop
---

*********
Build Cop
*********
#### Overview

.. _overview:

Overview
--------

The Drake build cop monitors `continuous <https://drake-
jenkins.csail.mit.edu/view/Continuous%20Production/>`_, `nightly
<https://drake-jenkins.csail.mit.edu/view/Nightly%20Production/>`_, and
`weekly <https://drake-jenkins.csail.mit.edu/view/Weekly%20Production/>`_
The Drake build cop monitors [continuous](https://drake-jenkins.csail.mit.edu/view/Continuous%20Production/),
[nightly](https://drake-jenkins.csail.mit.edu/view/Nightly%20Production/), and
[weekly](https://drake-jenkins.csail.mit.edu/view/Weekly%20Production/)
production continuous integration failures in the
`RobotLocomotion/drake <https://github.com/RobotLocomotion/drake>`_ GitHub
[RobotLocomotion/drake](https://github.com/RobotLocomotion/drake) GitHub
repo.

The build cop will rotate on a weekly basis. The
`schedule <https://github.com/RobotLocomotion/drake-ci/wiki/Build-Cop-Rotation>`_
[schedule](https://github.com/RobotLocomotion/drake-ci/wiki/Build-Cop-Rotation)
is maintained on the
`RobotLocomotion/drake-ci <https://github.com/RobotLocomotion/drake-ci>`_ wiki.
[RobotLocomotion/drake-ci](https://github.com/RobotLocomotion/drake-ci) wiki.

.. _process:
#### Process

Process
-------
The build cop is expected to be on duty during normal business hours Eastern
Time, approximately 9am to 5pm on weekdays, holidays excepted. Developers are
encouraged, but not required, to merge pull requests during times when the build
Expand All @@ -50,25 +42,20 @@ fix the failure within 60 minutes, the build cop will merge the pull request to
revert the commits and verify that the continuous builds triggered by that merge
pass.

Use the `DrakeDevelopers Slack channel
#buildcop <https://drakedevelopers.slack.com/messages/buildcop/details/>`_
Use the [DrakeDevelopers Slack channel #buildcop](https://drakedevelopers.slack.com/messages/buildcop/details/)
to discuss build issues with your partner build cop and other Drake
contributors.

At the end of each rotation, the build cop should complete the
`build cop review and retrospective
<https://docs.google.com/document/d/120AOAaamIMO-SM1UaJ6vfzpA15LnXHexDF4a7MLAS3o/edit#heading=h.sxk1djc2v0yg>`_,
and should notify the next build cop on the `DrakeDevelopers Slack channel
#buildcop <https://drakedevelopers.slack.com/messages/buildcop/details/>`_.
[build cop review and retrospective](https://docs.google.com/document/d/120AOAaamIMO-SM1UaJ6vfzpA15LnXHexDF4a7MLAS3o/edit#heading=h.sxk1djc2v0yg),
and should notify the next build cop on the [DrakeDevelopers Slack channel #buildcop](https://drakedevelopers.slack.com/messages/buildcop/details/).

.. _revert_template:
#### Revert Template

Revert Template
---------------
When creating a revert PR, the build cop will assign that PR to the original
author, and include the following template in the PR description.

::
```
Dear $AUTHOR,
Expand Down Expand Up @@ -104,45 +91,37 @@ author, and include the following template in the PR description.
[1] CI Production Dashboard: https://drake-jenkins.csail.mit.edu/view/Production/
[2] https://drake.mit.edu/buildcop.html#workflow-for-handling-a-build-cop-revert
```

.. _handling_a_build_cop_revert:

Workflow for Handling a Build Cop Revert
----------------------------------------
#### Workflow for Handling a Build Cop Revert

Suppose your merged PR was reverted on the master branch. What do you do?

Here's one workflow:

1. Create a new development branch based off of the ``HEAD`` of master.

2. `Revert <https://git-scm.com/docs/git-revert>`_ the revert of your
2. [Revert](https://git-scm.com/docs/git-revert) the revert of your
originally-merged PR to get your changes back.

3. Debug the problem. This may require you to
:ref:`run on-demand continuous integration builds <run_specific_build>` to
[run on-demand continuous integration](/jenkins.html#scheduling-an-on-demand-build) to
ensure the problem that caused your PR to be reverted was actually fixed.

4. Commit your changes into your new branch.

5. Issue a new PR containing your fixes. Be sure to link to the build cop revert
PR in your new PR.


.. _build_cop_playbook:
#### Build Cop Playbook

Build Cop Playbook
------------------
This section is a quick-reference manual for the on-call build cop.

Monitor the Build
^^^^^^^^^^^^^^^^^
Check the `Continuous Production <https://drake-jenkins.csail.mit.edu/view/Continuous%20Production/>`_
##### Monitor the Build

Check the [Continuous Production](https://drake-jenkins.csail.mit.edu/view/Continuous%20Production/)
build dashboard in Jenkins at least once an hour during on-call hours. These
builds run after every merge to Drake. Also check the
`Nightly Production <https://drake-jenkins.csail.mit.edu/view/Nightly%20Production/>`_
[Nightly Production](https://drake-jenkins.csail.mit.edu/view/Nightly%20Production/)
build dashboard every morning and
`Weekly Production <https://drake-jenkins.csail.mit.edu/view/Weekly%20Production/>`_
[Weekly Production](https://drake-jenkins.csail.mit.edu/view/Weekly%20Production/)
build dashboard on Monday morning. These builds are unusually
resource-intensive, and therefore run at most once per day.

Expand All @@ -154,8 +133,8 @@ color of the previous build.

Note that CDash pages may take a minute to populate.

Respond to Breakage
^^^^^^^^^^^^^^^^^^^
##### Respond to Breakage

There are various reasons the build might break. Diagnose the failure, and
then take appropriate action. This section lists some common failures and
recommended responses. However, build cops often have to address unexpected
Expand All @@ -171,12 +150,11 @@ Determine if an open GitHub Drake issue describes the situation. For example,
some tests are flaky for reasons that have no known resolution, but are
described by Drake issues. If you find that your broken build is described by
such an issue, consider adding the build information to the issue for future
analysis. The `build cop review and retrospective
<https://docs.google.com/document/d/120AOAaamIMO-SM1UaJ6vfzpA15LnXHexDF4a7MLAS3o/edit#heading=h.sxk1djc2v0yg>`_
analysis. The [build cop review and retrospective](https://docs.google.com/document/d/120AOAaamIMO-SM1UaJ6vfzpA15LnXHexDF4a7MLAS3o/edit#heading=h.sxk1djc2v0yg)
also describes current build issues.

Broken Compile or Test
**********************
##### Broken Compile or Test

Sometimes people merge code that doesn't compile, or that fails a test.
This can happen for several reasons:

Expand All @@ -191,15 +169,15 @@ Consult the list of commits in the breaking change to identify possible culprit
PRs. Try to rule out some of those PRs by comparing their contents to the
specifics of the failure. For any PRs you cannot rule out, create a rollback
by clicking "Revert" in the GitHub UI. Use the
:ref:`template message <revert_template>` to communicate with the author, and
[template message](/buildcop.html#revert-template) to communicate with the author, and
proceed as specified in that message.

:ref:`Manually schedule <run_specific_build>` the failing build as an
[Manually schedule](/jenkins.html#run-specific-build) the failing build as an
experimental build on the rollback PR. If it passes, the odds are good that you
have found the culprit. Proceed as specified in the template message.

Flaky Test
**********
##### Flaky Test

Sometimes people introduce code that makes a test non-deterministic, failing
on some runs and passing on others. You cannot reliably attribute a flaky test
failure to the first failing build, because it may have passed by chance for
Expand All @@ -209,11 +187,11 @@ Test failures will be yellow in Jenkins. If the list of commits in the breaking
change does not include any plausible culprits, you may be looking at a flaky
test. Look through earlier commits one-by-one for plausible culprits.
After you identify one, create a rollback by clicking "Revert" in the
GitHub UI. Use the :ref:`template message <revert_template>` to communicate
GitHub UI. Use the [template message](/buildcop.html#revert-template) to communicate
with the author, and proceed as specified in that message.

Restarting Mac Nightly Builds
******************************
##### Restarting Mac Nightly Builds

Occasionally there will be flaky tests or timeouts in the Mac nightly builds.
While it is tempting to restart these builds to clear the errors, Mac resources
are limited and restarting the long-running nightly builds may tie up resources
Expand All @@ -228,20 +206,20 @@ their best judgement, keeping in mind the following guidelines:
* If the timed-out test failed last build (not just timed out), you may consider re-running.


Broken CI Script
****************
##### Broken CI Script

Sometimes people merge changes to the Drake CI scripts that result in spurious
CI failures. The list of commits in Jenkins for each continuous build includes
the `drake-ci <https://github.com/RobotLocomotion/drake-ci>`_ repository as well
the [drake-ci](https://github.com/RobotLocomotion/drake-ci) repository as well
as Drake proper. Consider whether those changes are possible culprits.

If you believe a CI script change is the culprit, contact the author.
If they are not responsive, revert the commit yourself and see what happens on
the next continuous build. There are no pre-merge builds you can run that
exercise changes to the CI scripts themselves.

Infrastructure Flake
********************
##### Infrastructure Flake

The machinery of the CI system itself sometimes fails for reasons unrelated to
any code change. The most common infrastructure flakes include:

Expand All @@ -263,8 +241,8 @@ can be safely ignored.
If you see "All nodes of label <label> are offline", this should disappear
eventually and the build should run, once Jenkins gets a node booted up.

Infrastructure Collapse
***********************
##### Infrastructure Collapse

Occasionally, some piece of CI infrastructure completely stops working. For
instance, GitHub, AWS, or MacStadium could have an outage, or our Jenkins server
could crash or become wedged. During infrastructure collapses, lots of builds
Expand All @@ -275,9 +253,9 @@ alert Kitware by assigning a GitHub issue to both @BetsyMcPhail and
@jamiesnape. If it's under a vendor's control, spread the news and simply wait
it out.

Drake External Examples
***********************
Details of failures in the `drake-external-examples <https://github.com/RobotLocomotion/drake-external-examples/>`_
##### Drake External Examples

Details of failures in the [drake-external-examples](https://github.com/RobotLocomotion/drake-external-examples/)
repository, which may be denoted by red "build failing" icons at the top of the build
dashboard on Jenkins, should be posted to the `#buildcop <https://drakedevelopers.slack.com/messages/buildcop/details/>`_
dashboard on Jenkins, should be posted to the [#buildcop](https://drakedevelopers.slack.com/messages/buildcop/details/)
channel on Slack, ensuring that @jamiesnape is mentioned in the message.
Loading

0 comments on commit 777937d

Please sign in to comment.