Skip to content

Commit b2a7eed

Browse files
lins05srowen
authored andcommittedSep 28, 2016
[SPARK-17017][ML][MLLIB][ML][DOC] Updated the ml/mllib feature selection docs for ChiSqSelector
## What changes were proposed in this pull request? A follow up for apache#14597 to update feature selection docs about ChiSqSelector. ## How was this patch tested? Generated html docs. It can be previewed at: * ml: http://sparkdocs.lins05.pw/spark-17017/ml-features.html#chisqselector * mllib: http://sparkdocs.lins05.pw/spark-17017/mllib-feature-extraction.html#chisqselector Author: Shuai Lin <[email protected]> Closes apache#15236 from lins05/spark-17017-update-docs-for-chisq-selector-fpr.
1 parent 4a83395 commit b2a7eed

File tree

2 files changed

+20
-8
lines changed

2 files changed

+20
-8
lines changed
 

‎docs/ml-features.md

+10-4
Original file line numberDiff line numberDiff line change
@@ -1331,10 +1331,16 @@ for more details on the API.
13311331
## ChiSqSelector
13321332

13331333
`ChiSqSelector` stands for Chi-Squared feature selection. It operates on labeled data with
1334-
categorical features. ChiSqSelector orders features based on a
1335-
[Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test)
1336-
from the class, and then filters (selects) the top features which the class label depends on the
1337-
most. This is akin to yielding the features with the most predictive power.
1334+
categorical features. ChiSqSelector uses the
1335+
[Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
1336+
features to choose. It supports three selection methods: `KBest`, `Percentile` and `FPR`:
1337+
1338+
* `KBest` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
1339+
* `Percentile` is similar to `KBest` but chooses a fraction of all features instead of a fixed number.
1340+
* `FPR` chooses all features whose false positive rate meets some threshold.
1341+
1342+
By default, the selection method is `KBest`, the default number of top features is 50. User can use
1343+
`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection methods.
13381344

13391345
**Examples**
13401346

‎docs/mllib-feature-extraction.md

+10-4
Original file line numberDiff line numberDiff line change
@@ -225,10 +225,16 @@ features for use in model construction. It reduces the size of the feature space
225225
both speed and statistical learning behavior.
226226

227227
[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) implements
228-
Chi-Squared feature selection. It operates on labeled data with categorical features.
229-
`ChiSqSelector` orders features based on a Chi-Squared test of independence from the class,
230-
and then filters (selects) the top features which the class label depends on the most.
231-
This is akin to yielding the features with the most predictive power.
228+
Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the
229+
[Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
230+
features to choose. It supports three selection methods: `KBest`, `Percentile` and `FPR`:
231+
232+
* `KBest` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
233+
* `Percentile` is similar to `KBest` but chooses a fraction of all features instead of a fixed number.
234+
* `FPR` chooses all features whose false positive rate meets some threshold.
235+
236+
By default, the selection method is `KBest`, the default number of top features is 50. User can use
237+
`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection methods.
232238

233239
The number of features to select can be tuned using a held-out validation set.
234240

0 commit comments

Comments
 (0)
Please sign in to comment.