Skip to content

Commit

Permalink
[SPARK-10595] [ML] [MLLIB] [DOCS] Various ML guide cleanups
Browse files Browse the repository at this point in the history
Various ML guide cleanups.

* ml-guide.md: Make it easier to access the algorithm-specific guides.
* LDA user guide: EM often begins with useless topics, but running longer generally improves them dramatically.  E.g., 10 iterations on a Wikipedia dataset produces useless topics, but 50 iterations produces very meaningful topics.
* mllib-feature-extraction.html#elementwiseproduct: “w” parameter should be “scalingVec”
* Clean up Binarizer user guide a little.
* Document in Pipeline that users should not put an instance into the Pipeline in more than 1 place.
* spark.ml Word2Vec user guide: clean up grammar/writing
* Chi Sq Feature Selector docs: Improve text in doc.

CC: mengxr feynmanliang

Author: Joseph K. Bradley <[email protected]>

Closes #8752 from jkbradley/mlguide-fixes-1.5.
  • Loading branch information
jkbradley authored and mengxr committed Sep 16, 2015
1 parent 64c29af commit b921fe4
Show file tree
Hide file tree
Showing 5 changed files with 91 additions and 35 deletions.
34 changes: 30 additions & 4 deletions docs/ml-features.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,12 +123,21 @@ for features_label in rescaledData.select("features", "label").take(3):

## Word2Vec

`Word2Vec` is an `Estimator` which takes sequences of words that represents documents and trains a `Word2VecModel`. The model is a `Map(String, Vector)` essentially, which maps each word to an unique fix-sized vector. The `Word2VecModel` transforms each documents into a vector using the average of all words in the document, which aims to other computations of documents such as similarity calculation consequencely. Please refer to the [MLlib user guide on Word2Vec](mllib-feature-extraction.html#Word2Vec) for more details on Word2Vec.
`Word2Vec` is an `Estimator` which takes sequences of words representing documents and trains a
`Word2VecModel`. The model maps each word to a unique fixed-size vector. The `Word2VecModel`
transforms each document into a vector using the average of all words in the document; this vector
can then be used for as features for prediction, document similarity calculations, etc.
Please refer to the [MLlib user guide on Word2Vec](mllib-feature-extraction.html#Word2Vec) for more
details.

Word2Vec is implemented in [Word2Vec](api/scala/index.html#org.apache.spark.ml.feature.Word2Vec). In the following code segment, we start with a set of documents, each of them is represented as a sequence of words. For each document, we transform it into a feature vector. This feature vector could then be passed to a learning algorithm.
In the following code segment, we start with a set of documents, each of which is represented as a sequence of words. For each document, we transform it into a feature vector. This feature vector could then be passed to a learning algorithm.

<div class="codetabs">
<div data-lang="scala" markdown="1">

Refer to the [Word2Vec Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Word2Vec)
for more details on the API.

{% highlight scala %}
import org.apache.spark.ml.feature.Word2Vec

Expand All @@ -152,6 +161,10 @@ result.select("result").take(3).foreach(println)
</div>

<div data-lang="java" markdown="1">

Refer to the [Word2Vec Java docs](api/java/org/apache/spark/ml/feature/Word2Vec.html)
for more details on the API.

{% highlight java %}
import java.util.Arrays;

Expand Down Expand Up @@ -192,6 +205,10 @@ for (Row r: result.select("result").take(3)) {
</div>

<div data-lang="python" markdown="1">

Refer to the [Word2Vec Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.Word2Vec)
for more details on the API.

{% highlight python %}
from pyspark.ml.feature import Word2Vec

Expand Down Expand Up @@ -621,12 +638,15 @@ for ngrams_label in ngramDataFrame.select("ngrams", "label").take(3):

## Binarizer

Binarization is the process of thresholding numerical features to binary features. As some probabilistic estimators make assumption that the input data is distributed according to [Bernoulli distribution](http://en.wikipedia.org/wiki/Bernoulli_distribution), a binarizer is useful for pre-processing the input data with continuous numerical features.
Binarization is the process of thresholding numerical features to binary (0/1) features.

A simple [Binarizer](api/scala/index.html#org.apache.spark.ml.feature.Binarizer) class provides this functionality. Besides the common parameters of `inputCol` and `outputCol`, `Binarizer` has the parameter `threshold` used for binarizing continuous numerical features. The features greater than the threshold, will be binarized to 1.0. The features equal to or less than the threshold, will be binarized to 0.0. The example below shows how to binarize numerical features.
`Binarizer` takes the common parameters `inputCol` and `outputCol`, as well as the `threshold` for binarization. Feature values greater than the threshold are binarized to 1.0; values equal to or less than the threshold are binarized to 0.0.

<div class="codetabs">
<div data-lang="scala" markdown="1">

Refer to the [Binarizer API doc](api/scala/index.html#org.apache.spark.ml.feature.Binarizer) for more details.

{% highlight scala %}
import org.apache.spark.ml.feature.Binarizer
import org.apache.spark.sql.DataFrame
Expand All @@ -650,6 +670,9 @@ binarizedFeatures.collect().foreach(println)
</div>

<div data-lang="java" markdown="1">

Refer to the [Binarizer API doc](api/java/org/apache/spark/ml/feature/Binarizer.html) for more details.

{% highlight java %}
import java.util.Arrays;

Expand Down Expand Up @@ -687,6 +710,9 @@ for (Row r : binarizedFeatures.collect()) {
</div>

<div data-lang="python" markdown="1">

Refer to the [Binarizer API doc](api/python/pyspark.ml.html#pyspark.ml.feature.Binarizer) for more details.

{% highlight python %}
from pyspark.ml.feature import Binarizer

Expand Down
31 changes: 20 additions & 11 deletions docs/ml-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,21 @@ See the [algorithm guides](#algorithm-guides) section below for guides on sub-pa
* This will become a table of contents (this text will be scraped).
{:toc}

# Main concepts
# Algorithm guides

We provide several algorithm guides specific to the Pipelines API.
Several of these algorithms, such as certain feature transformers, are not in the `spark.mllib` API.
Also, some algorithms have additional capabilities in the `spark.ml` API; e.g., random forests
provide class probabilities, and linear models provide model summaries.

* [Feature extraction, transformation, and selection](ml-features.html)
* [Decision Trees for classification and regression](ml-decision-tree.html)
* [Ensembles](ml-ensembles.html)
* [Linear methods with elastic net regularization](ml-linear-methods.html)
* [Multilayer perceptron classifier](ml-ann.html)


# Main concepts in Pipelines

Spark ML standardizes APIs for machine learning algorithms to make it easier to combine multiple
algorithms into a single pipeline, or workflow.
Expand Down Expand Up @@ -166,6 +180,11 @@ compile-time type checking.
`Pipeline`s and `PipelineModel`s instead do runtime checking before actually running the `Pipeline`.
This type checking is done using the `DataFrame` *schema*, a description of the data types of columns in the `DataFrame`.

*Unique Pipeline stages*: A `Pipeline`'s stages should be unique instances. E.g., the same instance
`myHashingTF` should not be inserted into the `Pipeline` twice since `Pipeline` stages must have
unique IDs. However, different instances `myHashingTF1` and `myHashingTF2` (both of type `HashingTF`)
can be put into the same `Pipeline` since different instances will be created with different IDs.

## Parameters

Spark ML `Estimator`s and `Transformer`s use a uniform API for specifying parameters.
Expand All @@ -184,16 +203,6 @@ Parameters belong to specific instances of `Estimator`s and `Transformer`s.
For example, if we have two `LogisticRegression` instances `lr1` and `lr2`, then we can build a `ParamMap` with both `maxIter` parameters specified: `ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)`.
This is useful if there are two algorithms with the `maxIter` parameter in a `Pipeline`.

# Algorithm guides

There are now several algorithms in the Pipelines API which are not in the `spark.mllib` API, so we link to documentation for them here. These algorithms are mostly feature transformers, which fit naturally into the `Transformer` abstraction in Pipelines, and ensembles, which fit naturally into the `Estimator` abstraction in the Pipelines.

* [Feature extraction, transformation, and selection](ml-features.html)
* [Decision Trees for classification and regression](ml-decision-tree.html)
* [Ensembles](ml-ensembles.html)
* [Linear methods with elastic net regularization](ml-linear-methods.html)
* [Multilayer perceptron classifier](ml-ann.html)

# Code examples

This section gives code examples illustrating the functionality discussed above.
Expand Down
4 changes: 4 additions & 0 deletions docs/mllib-clustering.md
Original file line number Diff line number Diff line change
Expand Up @@ -507,6 +507,10 @@ must also be $> 1.0$. Providing `Vector(-1)` results in default behavior
$> 1.0$. Providing `-1` results in defaulting to a value of $0.1 + 1$.
* `maxIterations`: The maximum number of EM iterations.

*Note*: It is important to do enough iterations. In early iterations, EM often has useless topics,
but those topics improve dramatically after more iterations. Using at least 20 and possibly
50-100 iterations is often reasonable, depending on your dataset.

`EMLDAOptimizer` produces a `DistributedLDAModel`, which stores not only
the inferred topics but also the full training corpus and topic
distributions for each document in the training corpus. A
Expand Down
53 changes: 35 additions & 18 deletions docs/mllib-feature-extraction.md
Original file line number Diff line number Diff line change
Expand Up @@ -380,35 +380,43 @@ data2 = labels.zip(normalizer2.transform(features))
</div>
</div>

## Feature selection
[Feature selection](http://en.wikipedia.org/wiki/Feature_selection) allows selecting the most relevant features for use in model construction. Feature selection reduces the size of the vector space and, in turn, the complexity of any subsequent operation with vectors. The number of features to select can be tuned using a held-out validation set.
## ChiSqSelector

### ChiSqSelector
[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) stands for Chi-Squared feature selection. It operates on labeled data with categorical features. `ChiSqSelector` orders features based on a Chi-Squared test of independence from the class, and then filters (selects) the top features which the class label depends on the most. This is akin to yielding the features with the most predictive power.
[Feature selection](http://en.wikipedia.org/wiki/Feature_selection) tries to identify relevant
features for use in model construction. It reduces the size of the feature space, which can improve
both speed and statistical learning behavior.

#### Model Fitting
[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) implements
Chi-Squared feature selection. It operates on labeled data with categorical features.
`ChiSqSelector` orders features based on a Chi-Squared test of independence from the class,
and then filters (selects) the top features which the class label depends on the most.
This is akin to yielding the features with the most predictive power.

[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) has the
following parameters in the constructor:
The number of features to select can be tuned using a held-out validation set.

* `numTopFeatures` number of top features that the selector will select (filter).
### Model Fitting

We provide a [`fit`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) method in
`ChiSqSelector` which can take an input of `RDD[LabeledPoint]` with categorical features, learn the summary statistics, and then
return a `ChiSqSelectorModel` which can transform an input dataset into the reduced feature space.
`ChiSqSelector` takes a `numTopFeatures` parameter specifying the number of top features that
the selector will select.

This model implements [`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer)
which can apply the Chi-Squared feature selection on a `Vector` to produce a reduced `Vector` or on
The [`fit`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) method takes
an input of `RDD[LabeledPoint]` with categorical features, learns the summary statistics, and then
returns a `ChiSqSelectorModel` which can transform an input dataset into the reduced feature space.
The `ChiSqSelectorModel` can be applied either to a `Vector` to produce a reduced `Vector`, or to
an `RDD[Vector]` to produce a reduced `RDD[Vector]`.

Note that the user can also construct a `ChiSqSelectorModel` by hand by providing an array of selected feature indices (which must be sorted in ascending order).

#### Example
### Example

The following example shows the basic use of ChiSqSelector. The data set used has a feature matrix consisting of greyscale values that vary from 0 to 255 for each feature.

<div class="codetabs">
<div data-lang="scala">
<div data-lang="scala" markdown="1">

Refer to the [`ChiSqSelector` Scala docs](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector)
for details on the API.

{% highlight scala %}
import org.apache.spark.SparkContext._
import org.apache.spark.mllib.linalg.Vectors
Expand All @@ -434,7 +442,11 @@ val filteredData = discretizedData.map { lp =>
{% endhighlight %}
</div>

<div data-lang="java">
<div data-lang="java" markdown="1">

Refer to the [`ChiSqSelector` Java docs](api/java/org/apache/spark/mllib/feature/ChiSqSelector.html)
for details on the API.

{% highlight java %}
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
Expand Down Expand Up @@ -486,7 +498,12 @@ sc.stop();

## ElementwiseProduct

ElementwiseProduct multiplies each input vector by a provided "weight" vector, using element-wise multiplication. In other words, it scales each column of the dataset by a scalar multiplier. This represents the [Hadamard product](https://en.wikipedia.org/wiki/Hadamard_product_%28matrices%29) between the input vector, `v` and transforming vector, `w`, to yield a result vector.
`ElementwiseProduct` multiplies each input vector by a provided "weight" vector, using element-wise
multiplication. In other words, it scales each column of the dataset by a scalar multiplier. This
represents the [Hadamard product](https://en.wikipedia.org/wiki/Hadamard_product_%28matrices%29)
between the input vector, `v` and transforming vector, `scalingVec`, to yield a result vector.
Qu8T948*1#
Denoting the `scalingVec` as "`w`," this transformation may be written as:

`\[ \begin{pmatrix}
v_1 \\
Expand All @@ -506,7 +523,7 @@ v_N

[`ElementwiseProduct`](api/scala/index.html#org.apache.spark.mllib.feature.ElementwiseProduct) has the following parameter in the constructor:

* `w`: the transforming vector.
* `scalingVec`: the transforming vector.

`ElementwiseProduct` implements [`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer) which can apply the weighting on a `Vector` to produce a transformed `Vector` or on an `RDD[Vector]` to produce a transformed `RDD[Vector]`.

Expand Down
4 changes: 2 additions & 2 deletions docs/mllib-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,9 @@ primitives and higher-level pipeline APIs.

It divides into two packages:

* [`spark.mllib`](mllib-guide.html#mllib-types-algorithms-and-utilities) contains the original API
* [`spark.mllib`](mllib-guide.html#data-types-algorithms-and-utilities) contains the original API
built on top of [RDDs](programming-guide.html#resilient-distributed-datasets-rdds).
* [`spark.ml`](mllib-guide.html#sparkml-high-level-apis-for-ml-pipelines) provides higher-level API
* [`spark.ml`](ml-guide.html) provides higher-level API
built on top of [DataFrames](sql-programming-guide.html#dataframes) for constructing ML pipelines.

Using `spark.ml` is recommended because with DataFrames the API is more versatile and flexible.
Expand Down

0 comments on commit b921fe4

Please sign in to comment.