From 78627425708a0afbe113efdf449e8622b43b652d Mon Sep 17 00:00:00 2001 From: "Joseph K. Bradley" Date: Wed, 14 Dec 2016 14:10:40 -0800 Subject: [PATCH] [SPARK-18795][ML][SPARKR][DOC] Added KSTest section to SparkR vignettes ## What changes were proposed in this pull request? Added short section for KSTest. Also added logreg model to list of ML models in vignette. (This will be reorganized under SPARK-18849) ![screen shot 2016-12-14 at 1 37 31 pm](https://cloud.githubusercontent.com/assets/5084283/21202140/7f24e240-c202-11e6-9362-458208bb9159.png) ## How was this patch tested? Manually tested example locally. Built vignettes locally. Author: Joseph K. Bradley Closes #16283 from jkbradley/ksTest-vignette. --- R/pkg/vignettes/sparkr-vignettes.Rmd | 29 +++++++++++++++++++++++++++- 1 file changed, 28 insertions(+), 1 deletion(-) diff --git a/R/pkg/vignettes/sparkr-vignettes.Rmd b/R/pkg/vignettes/sparkr-vignettes.Rmd index 334daa51f019d..d507e2cdf941b 100644 --- a/R/pkg/vignettes/sparkr-vignettes.Rmd +++ b/R/pkg/vignettes/sparkr-vignettes.Rmd @@ -469,6 +469,10 @@ SparkR supports the following machine learning models and algorithms. * Isotonic Regression Model +* Logistic Regression Model + +* Kolmogorov-Smirnov Test + More will be added in the future. ### R Formula @@ -800,7 +804,7 @@ newDF <- createDataFrame(data.frame(x = c(1.5, 3.2))) head(predict(isoregModel, newDF)) ``` -### Logistic Regression Model +#### Logistic Regression Model (Added in 2.1.0) @@ -834,6 +838,29 @@ model <- spark.logit(df, Species ~ ., regParam = 0.5) summary(model) ``` +#### Kolmogorov-Smirnov Test + +`spark.kstest` runs a two-sided, one-sample [Kolmogorov-Smirnov (KS) test](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test). +Given a `SparkDataFrame`, the test compares continuous data in a given column `testCol` with the theoretical distribution +specified by parameter `nullHypothesis`. +Users can call `summary` to get a summary of the test results. + +In the following example, we test whether the `longley` dataset's `Armed_Forces` column +follows a normal distribution. We set the parameters of the normal distribution using +the mean and standard deviation of the sample. + +```{r, warning=FALSE} +df <- createDataFrame(longley) +afStats <- head(select(df, mean(df$Armed_Forces), sd(df$Armed_Forces))) +afMean <- afStats[1] +afStd <- afStats[2] + +test <- spark.kstest(df, "Armed_Forces", "norm", c(afMean, afStd)) +testSummary <- summary(test) +testSummary +``` + + ### Model Persistence The following example shows how to save/load an ML model by SparkR. ```{r, warning=FALSE}