add expectations for flaky test issues

mattf · Jan 25, 2016 · ac5b954 · ac5b954
1 parent 10f7985
commit ac5b954
Showing 1 changed file with 87 additions and 1 deletion.
diff --git a/docs/devel/flaky-tests.md b/docs/devel/flaky-tests.md
@@ -32,7 +32,93 @@ Documentation for other releases can be found at
 
 <!-- END MUNGE: UNVERSIONED_WARNING -->
 
-# Hunting flaky tests in Kubernetes
+# Flaky tests
+
+Any test that fails occasionally is "flaky". Since our merges only proceed when
+all tests are green, and we have a number of different CI systems running the
+tests in various combinations, even a small percentage of flakes results in a
+lot of pain for people waiting for their PRs to merge.
+
+Therefore, it's very important that we write tests defensively. Situations that
+"almost never happen" happen with some regularity when run thousands of times in
+resource-constrained environments. Since flakes can often be quite hard to
+reproduce while still being common enough to block merges occasionally, it's
+additionally important that the test logs be useful for narrowing down exactly
+what caused the failure.
+
+Note that flakes can occur in unit tests, integration tests, or end-to-end
+tests, but probably occur most commonly in end-to-end tests.
+
+## Filing issues for flaky tests
+
+Because flakes may be rare, it's very important that all relevant logs be
+discoverable from the issue.
+
+1. Search for the test name. If you find an open issue and you're 90% sure the
+   flake is exactly the same, add a comment instead of making a new issue.
+2. If you make a new issue, you should title it with the test name, prefixed by
+   "e2e/unit/integration flake:" (whichever is appropriate)
+3. Reference any old issues you found in step one.
+4. Paste, in block quotes, the entire log of the individual failing test, not
+   just the failure line.
+5. Link to durable storage with the rest of the logs. This means (for all the
+   tests that Google runs) the GCS link is mandatory! The Jenkins test result
+   link is nice but strictly optional: not only does it expire more quickly,
+   it's not accesible to non-Googlers.
+
+## Expectations when a flaky test is assigned to you
+
+Note that we won't randomly assign these issues to you unless you've opted in or
+you're part of a group that has opted in. We are more than happy to accept help
+from anyone in fixing these, but due to the severity of the problem when merges
+are blocked, we need reasonably quick turn-around time on test flakes. Therefore
+we have the following guidelines:
+
+1. If a flaky test is assigned to you, it's more important than anything else
+   you're doing unless you can get a special dispensation (in which case it will
+   be reassigned).  If you have too many flaky tests assigned to you, or you
+   have such a dispensation, then it's *still* your responsibility to find new
+   owners (this may just mean giving stuff back to the relevant Team or SIG Lead).
+2. You should make a reasonable effort to reproduce it. Somewhere between an
+   hour and half a day of concentrated effort is "reasonable". It is perfectly
+   reasonable to ask for help!
+3. If you can reproduce it (or it's obvious from the logs what happened), you
+   should then be able to fix it, or in the case where someone is clearly more
+   qualified to fix it, reassign it with very clear instructions.
+4. If you can't reproduce it: __don't just close it!__ Every time a flake comes
+   back, at least 2 hours of merge time is wasted. So we need to make monotonic
+   progress towards narrowing it down every time a flake occurs. If you can't
+   figure it out from the logs, add log messages that would have help you figure
+   it out.
+
+# Reproducing unit test flakes
+
+Try the [stress command](https://godoc.org/golang.org/x/tools/cmd/stress).
+
+Just
+
+```
+$ go install golang.org/x/tools/cmd/stress
+```
+
+Then build your test binary
+
+```
+$ godep go test -c -race
+```
+
+Then run it under stress
+
+```
+$ stress ./package.test -test.run=FlakyTest
+```
+
+It runs the command and writes output to `/tmp/gostress-*` files when it fails.
+It periodically reports with run counts. Be careful with tests that use the
+`net/http/httptest` package; they could exhaust the available ports on your
+system!
+
+# Hunting flaky unit tests in Kubernetes
 
 Sometimes unit tests are flaky.  This means that due to (usually) race conditions, they will occasionally fail, even though most of the time they pass.