Skip to content

Commit

Permalink
[SPARK-2830][MLLIB] doc update for 1.1
Browse files Browse the repository at this point in the history
1. renamed mllib-basics to mllib-data-types
1. renamed mllib-stats to mllib-statistics
1. moved random data generation to the bottom of mllib-stats
1. updated toc accordingly

atalwalkar

Author: Xiangrui Meng <[email protected]>

Closes apache#2151 from mengxr/mllib-doc-1.1 and squashes the following commits:

0bd79f3 [Xiangrui Meng] add mllib-data-types
b64a5d7 [Xiangrui Meng] update the content list of basis statistics in mllib-guide
f625cc2 [Xiangrui Meng] move mllib-basics to mllib-data-types
4d69250 [Xiangrui Meng] move random data generation to the bottom of statistics
e64f3ce [Xiangrui Meng] move mllib-stats.md to mllib-statistics.md
  • Loading branch information
mengxr committed Aug 27, 2014
1 parent e1139dd commit 43dfc84
Show file tree
Hide file tree
Showing 4 changed files with 87 additions and 86 deletions.
4 changes: 2 additions & 2 deletions docs/mllib-basics.md → docs/mllib-data-types.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: global
title: Basics - MLlib
displayTitle: <a href="mllib-guide.html">MLlib</a> - Basics
title: Data Types - MLlib
displayTitle: <a href="mllib-guide.html">MLlib</a> - Data Types
---

* Table of contents
Expand Down
4 changes: 2 additions & 2 deletions docs/mllib-dimensionality-reduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - Dimensionality Reduction
of reducing the number of variables under consideration.
It can be used to extract latent features from raw and noisy features
or compress data while maintaining the structure.
MLlib provides support for dimensionality reduction on the <a href="mllib-basics.html#rowmatrix">RowMatrix</a> class.
MLlib provides support for dimensionality reduction on the <a href="mllib-data-types.html#rowmatrix">RowMatrix</a> class.

## Singular value decomposition (SVD)

Expand Down Expand Up @@ -58,7 +58,7 @@ passes, $O(n)$ storage on each executor, and $O(n k)$ storage on the driver.
### SVD Example

MLlib provides SVD functionality to row-oriented matrices, provided in the
<a href="mllib-basics.html#rowmatrix">RowMatrix</a> class.
<a href="mllib-data-types.html#rowmatrix">RowMatrix</a> class.

<div class="codetabs">
<div data-lang="scala" markdown="1">
Expand Down
9 changes: 5 additions & 4 deletions docs/mllib-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,13 @@ MLlib is Spark's scalable machine learning library consisting of common learning
including classification, regression, clustering, collaborative
filtering, dimensionality reduction, as well as underlying optimization primitives, as outlined below:

* [Data types](mllib-basics.html)
* [Basic statistics](mllib-stats.html)
* random data generation
* stratified sampling
* [Data types](mllib-data-types.html)
* [Basic statistics](mllib-statistics.html)
* summary statistics
* correlations
* stratified sampling
* hypothesis testing
* random data generation
* [Classification and regression](mllib-classification-regression.html)
* [linear models (SVMs, logistic regression, linear regression)](mllib-linear-methods.html)
* [decision trees](mllib-decision-tree.html)
Expand Down
156 changes: 78 additions & 78 deletions docs/mllib-stats.md → docs/mllib-statistics.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: global
title: Statistics Functionality - MLlib
displayTitle: <a href="mllib-guide.html">MLlib</a> - Statistics Functionality
title: Basic Statistics - MLlib
displayTitle: <a href="mllib-guide.html">MLlib</a> - Basic Statistics
---

* Table of contents
Expand All @@ -25,7 +25,7 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - Statistics Functionality
\newcommand{\zero}{\mathbf{0}}
\]`

## Summary Statistics
## Summary statistics

We provide column summary statistics for `RDD[Vector]` through the function `colStats`
available in `Statistics`.
Expand Down Expand Up @@ -104,81 +104,7 @@ print summary.numNonzeros()

</div>

## Random data generation

Random data generation is useful for randomized algorithms, prototyping, and performance testing.
MLlib supports generating random RDDs with i.i.d. values drawn from a given distribution:
uniform, standard normal, or Poisson.

<div class="codetabs">
<div data-lang="scala" markdown="1">
[`RandomRDDs`](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs) provides factory
methods to generate random double RDDs or vector RDDs.
The following example generates a random double RDD, whose values follows the standard normal
distribution `N(0, 1)`, and then map it to `N(1, 4)`.

{% highlight scala %}
import org.apache.spark.SparkContext
import org.apache.spark.mllib.random.RandomRDDs._

val sc: SparkContext = ...

// Generate a random double RDD that contains 1 million i.i.d. values drawn from the
// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
val u = normalRDD(sc, 1000000L, 10)
// Apply a transform to get a random double RDD following `N(1, 4)`.
val v = u.map(x => 1.0 + 2.0 * x)
{% endhighlight %}
</div>

<div data-lang="java" markdown="1">
[`RandomRDDs`](api/java/index.html#org.apache.spark.mllib.random.RandomRDDs) provides factory
methods to generate random double RDDs or vector RDDs.
The following example generates a random double RDD, whose values follows the standard normal
distribution `N(0, 1)`, and then map it to `N(1, 4)`.

{% highlight java %}
import org.apache.spark.SparkContext;
import org.apache.spark.api.JavaDoubleRDD;
import static org.apache.spark.mllib.random.RandomRDDs.*;

JavaSparkContext jsc = ...

// Generate a random double RDD that contains 1 million i.i.d. values drawn from the
// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
JavaDoubleRDD u = normalJavaRDD(jsc, 1000000L, 10);
// Apply a transform to get a random double RDD following `N(1, 4)`.
JavaDoubleRDD v = u.map(
new Function<Double, Double>() {
public Double call(Double x) {
return 1.0 + 2.0 * x;
}
});
{% endhighlight %}
</div>

<div data-lang="python" markdown="1">
[`RandomRDDs`](api/python/pyspark.mllib.random.RandomRDDs-class.html) provides factory
methods to generate random double RDDs or vector RDDs.
The following example generates a random double RDD, whose values follows the standard normal
distribution `N(0, 1)`, and then map it to `N(1, 4)`.

{% highlight python %}
from pyspark.mllib.random import RandomRDDs

sc = ... # SparkContext

# Generate a random double RDD that contains 1 million i.i.d. values drawn from the
# standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
u = RandomRDDs.uniformRDD(sc, 1000000L, 10)
# Apply a transform to get a random double RDD following `N(1, 4)`.
v = u.map(lambda x: 1.0 + 2.0 * x)
{% endhighlight %}
</div>

</div>

## Correlations calculation
## Correlations

Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
we provide the flexibility to calculate pairwise correlations among many series. The supported
Expand Down Expand Up @@ -455,3 +381,77 @@ for (ChiSqTestResult result : featureTestResults) {
</div>

</div>

## Random data generation

Random data generation is useful for randomized algorithms, prototyping, and performance testing.
MLlib supports generating random RDDs with i.i.d. values drawn from a given distribution:
uniform, standard normal, or Poisson.

<div class="codetabs">
<div data-lang="scala" markdown="1">
[`RandomRDDs`](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs) provides factory
methods to generate random double RDDs or vector RDDs.
The following example generates a random double RDD, whose values follows the standard normal
distribution `N(0, 1)`, and then map it to `N(1, 4)`.

{% highlight scala %}
import org.apache.spark.SparkContext
import org.apache.spark.mllib.random.RandomRDDs._

val sc: SparkContext = ...

// Generate a random double RDD that contains 1 million i.i.d. values drawn from the
// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
val u = normalRDD(sc, 1000000L, 10)
// Apply a transform to get a random double RDD following `N(1, 4)`.
val v = u.map(x => 1.0 + 2.0 * x)
{% endhighlight %}
</div>

<div data-lang="java" markdown="1">
[`RandomRDDs`](api/java/index.html#org.apache.spark.mllib.random.RandomRDDs) provides factory
methods to generate random double RDDs or vector RDDs.
The following example generates a random double RDD, whose values follows the standard normal
distribution `N(0, 1)`, and then map it to `N(1, 4)`.

{% highlight java %}
import org.apache.spark.SparkContext;
import org.apache.spark.api.JavaDoubleRDD;
import static org.apache.spark.mllib.random.RandomRDDs.*;

JavaSparkContext jsc = ...

// Generate a random double RDD that contains 1 million i.i.d. values drawn from the
// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
JavaDoubleRDD u = normalJavaRDD(jsc, 1000000L, 10);
// Apply a transform to get a random double RDD following `N(1, 4)`.
JavaDoubleRDD v = u.map(
new Function<Double, Double>() {
public Double call(Double x) {
return 1.0 + 2.0 * x;
}
});
{% endhighlight %}
</div>

<div data-lang="python" markdown="1">
[`RandomRDDs`](api/python/pyspark.mllib.random.RandomRDDs-class.html) provides factory
methods to generate random double RDDs or vector RDDs.
The following example generates a random double RDD, whose values follows the standard normal
distribution `N(0, 1)`, and then map it to `N(1, 4)`.

{% highlight python %}
from pyspark.mllib.random import RandomRDDs

sc = ... # SparkContext

# Generate a random double RDD that contains 1 million i.i.d. values drawn from the
# standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
u = RandomRDDs.uniformRDD(sc, 1000000L, 10)
# Apply a transform to get a random double RDD following `N(1, 4)`.
v = u.map(lambda x: 1.0 + 2.0 * x)
{% endhighlight %}
</div>

</div>

0 comments on commit 43dfc84

Please sign in to comment.