File tree 2 files changed +20
-8
lines changed
2 files changed +20
-8
lines changed Original file line number Diff line number Diff line change @@ -1331,10 +1331,16 @@ for more details on the API.
1331
1331
## ChiSqSelector
1332
1332
1333
1333
` ChiSqSelector ` stands for Chi-Squared feature selection. It operates on labeled data with
1334
- categorical features. ChiSqSelector orders features based on a
1335
- [ Chi-Squared test of independence] ( https://en.wikipedia.org/wiki/Chi-squared_test )
1336
- from the class, and then filters (selects) the top features which the class label depends on the
1337
- most. This is akin to yielding the features with the most predictive power.
1334
+ categorical features. ChiSqSelector uses the
1335
+ [ Chi-Squared test of independence] ( https://en.wikipedia.org/wiki/Chi-squared_test ) to decide which
1336
+ features to choose. It supports three selection methods: ` KBest ` , ` Percentile ` and ` FPR ` :
1337
+
1338
+ * ` KBest ` chooses the ` k ` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
1339
+ * ` Percentile ` is similar to ` KBest ` but chooses a fraction of all features instead of a fixed number.
1340
+ * ` FPR ` chooses all features whose false positive rate meets some threshold.
1341
+
1342
+ By default, the selection method is ` KBest ` , the default number of top features is 50. User can use
1343
+ ` setNumTopFeatures ` , ` setPercentile ` and ` setAlpha ` to set different selection methods.
1338
1344
1339
1345
** Examples**
1340
1346
Original file line number Diff line number Diff line change @@ -225,10 +225,16 @@ features for use in model construction. It reduces the size of the feature space
225
225
both speed and statistical learning behavior.
226
226
227
227
[ ` ChiSqSelector ` ] ( api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector ) implements
228
- Chi-Squared feature selection. It operates on labeled data with categorical features.
229
- ` ChiSqSelector ` orders features based on a Chi-Squared test of independence from the class,
230
- and then filters (selects) the top features which the class label depends on the most.
231
- This is akin to yielding the features with the most predictive power.
228
+ Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the
229
+ [ Chi-Squared test of independence] ( https://en.wikipedia.org/wiki/Chi-squared_test ) to decide which
230
+ features to choose. It supports three selection methods: ` KBest ` , ` Percentile ` and ` FPR ` :
231
+
232
+ * ` KBest ` chooses the ` k ` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
233
+ * ` Percentile ` is similar to ` KBest ` but chooses a fraction of all features instead of a fixed number.
234
+ * ` FPR ` chooses all features whose false positive rate meets some threshold.
235
+
236
+ By default, the selection method is ` KBest ` , the default number of top features is 50. User can use
237
+ ` setNumTopFeatures ` , ` setPercentile ` and ` setAlpha ` to set different selection methods.
232
238
233
239
The number of features to select can be tuned using a held-out validation set.
234
240
You can’t perform that action at this time.
0 commit comments