Skip to content

Commit

Permalink
[SPARK-18795][ML][SPARKR][DOC] Added KSTest section to SparkR vignettes
Browse files Browse the repository at this point in the history
## What changes were proposed in this pull request?

Added short section for KSTest.
Also added logreg model to list of ML models in vignette.  (This will be reorganized under SPARK-18849)

![screen shot 2016-12-14 at 1 37 31 pm](https://cloud.githubusercontent.com/assets/5084283/21202140/7f24e240-c202-11e6-9362-458208bb9159.png)

## How was this patch tested?

Manually tested example locally.
Built vignettes locally.

Author: Joseph K. Bradley <[email protected]>

Closes apache#16283 from jkbradley/ksTest-vignette.
  • Loading branch information
jkbradley committed Dec 14, 2016
1 parent 1ac6567 commit 7862742
Showing 1 changed file with 28 additions and 1 deletion.
29 changes: 28 additions & 1 deletion R/pkg/vignettes/sparkr-vignettes.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -469,6 +469,10 @@ SparkR supports the following machine learning models and algorithms.

* Isotonic Regression Model

* Logistic Regression Model

* Kolmogorov-Smirnov Test

More will be added in the future.

### R Formula
Expand Down Expand Up @@ -800,7 +804,7 @@ newDF <- createDataFrame(data.frame(x = c(1.5, 3.2)))
head(predict(isoregModel, newDF))
```

### Logistic Regression Model
#### Logistic Regression Model

(Added in 2.1.0)

Expand Down Expand Up @@ -834,6 +838,29 @@ model <- spark.logit(df, Species ~ ., regParam = 0.5)
summary(model)
```

#### Kolmogorov-Smirnov Test

`spark.kstest` runs a two-sided, one-sample [Kolmogorov-Smirnov (KS) test](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test).
Given a `SparkDataFrame`, the test compares continuous data in a given column `testCol` with the theoretical distribution
specified by parameter `nullHypothesis`.
Users can call `summary` to get a summary of the test results.

In the following example, we test whether the `longley` dataset's `Armed_Forces` column
follows a normal distribution. We set the parameters of the normal distribution using
the mean and standard deviation of the sample.

```{r, warning=FALSE}
df <- createDataFrame(longley)
afStats <- head(select(df, mean(df$Armed_Forces), sd(df$Armed_Forces)))
afMean <- afStats[1]
afStd <- afStats[2]
test <- spark.kstest(df, "Armed_Forces", "norm", c(afMean, afStd))
testSummary <- summary(test)
testSummary
```


### Model Persistence
The following example shows how to save/load an ML model by SparkR.
```{r, warning=FALSE}
Expand Down

0 comments on commit 7862742

Please sign in to comment.