Skip to content

Commit

Permalink
[SPARK-11960][MLLIB][DOC] User guide for streaming tests
Browse files Browse the repository at this point in the history
CC jkbradley mengxr josepablocam

Author: Feynman Liang <[email protected]>

Closes #10005 from feynmanliang/streaming-test-user-guide.
  • Loading branch information
feynmanliang authored and mengxr committed Nov 30, 2015
1 parent de64b65 commit 5535888
Show file tree
Hide file tree
Showing 3 changed files with 28 additions and 0 deletions.
1 change: 1 addition & 0 deletions docs/mllib-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ We list major functionality from both below, with links to detailed guides.
* [correlations](mllib-statistics.html#correlations)
* [stratified sampling](mllib-statistics.html#stratified-sampling)
* [hypothesis testing](mllib-statistics.html#hypothesis-testing)
* [streaming significance testing](mllib-statistics.html#streaming-significance-testing)
* [random data generation](mllib-statistics.html#random-data-generation)
* [Classification and regression](mllib-classification-regression.html)
* [linear models (SVMs, logistic regression, linear regression)](mllib-linear-methods.html)
Expand Down
25 changes: 25 additions & 0 deletions docs/mllib-statistics.md
Original file line number Diff line number Diff line change
Expand Up @@ -521,6 +521,31 @@ print(testResult) # summary of the test including the p-value, test statistic,
</div>
</div>

### Streaming Significance Testing
MLlib provides online implementations of some tests to support use cases
like A/B testing. These tests may be performed on a Spark Streaming
`DStream[(Boolean,Double)]` where the first element of each tuple
indicates control group (`false`) or treatment group (`true`) and the
second element is the value of an observation.

Streaming significance testing supports the following parameters:

* `peacePeriod` - The number of initial data points from the stream to
ignore, used to mitigate novelty effects.
* `windowSize` - The number of past batches to perform hypothesis
testing over. Setting to `0` will perform cumulative processing using
all prior batches.


<div class="codetabs">
<div data-lang="scala" markdown="1">
[`StreamingTest`](api/scala/index.html#org.apache.spark.mllib.stat.test.StreamingTest)
provides streaming hypothesis testing.

{% include_example scala/org/apache/spark/examples/mllib/StreamingTestExample.scala %}
</div>
</div>


## Random data generation

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ object StreamingTestExample {
dir.toString
})

// $example on$
val data = ssc.textFileStream(dataDir).map(line => line.split(",") match {
case Array(label, value) => (label.toBoolean, value.toDouble)
})
Expand All @@ -75,6 +76,7 @@ object StreamingTestExample {

val out = streamingTest.registerStream(data)
out.print()
// $example off$

// Stop processing if test becomes significant or we time out
var timeoutCounter = numBatchesTimeout
Expand Down

0 comments on commit 5535888

Please sign in to comment.