Skip to content

Commit 32286ba

Browse files
Peng, Mengsrowen
Peng, Meng
authored andcommittedJan 10, 2017
[SPARK-17645][MLLIB][ML][FOLLOW-UP] document minor change
## What changes were proposed in this pull request? Add FDR test case in ml/feature/ChiSqSelectorSuite. Improve some comments in the code. This is a follow-up pr for apache#15212. ## How was this patch tested? ut Author: Peng, Meng <[email protected]> Closes apache#16434 from mpjlu/fdr_fwe_update.
1 parent acfc5f3 commit 32286ba

File tree

7 files changed

+96
-34
lines changed

7 files changed

+96
-34
lines changed
 

‎docs/ml-features.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -1426,9 +1426,9 @@ categorical features. ChiSqSelector uses the
14261426
features to choose. It supports five selection methods: `numTopFeatures`, `percentile`, `fpr`, `fdr`, `fwe`:
14271427
* `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
14281428
* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
1429-
* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection.
1429+
* `fpr` chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection.
14301430
* `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold.
1431-
* `fwe` chooses all features whose p-values is below a threshold, thus controlling the family-wise error rate of selection.
1431+
* `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection.
14321432
By default, the selection method is `numTopFeatures`, with the default number of top features set to 50.
14331433
The user can choose a selection method using `setSelectorType`.
14341434

‎docs/mllib-feature-extraction.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -231,9 +231,9 @@ features to choose. It supports five selection methods: `numTopFeatures`, `perce
231231

232232
* `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
233233
* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
234-
* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection.
234+
* `fpr` chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection.
235235
* `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold.
236-
* `fwe` chooses all features whose p-values is below a threshold, thus controlling the family-wise error rate of selection.
236+
* `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection.
237237

238238
By default, the selection method is `numTopFeatures`, with the default number of top features set to 50.
239239
The user can choose a selection method using `setSelectorType`.

‎mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala

+3-3
Original file line numberDiff line numberDiff line change
@@ -143,13 +143,13 @@ private[feature] trait ChiSqSelectorParams extends Params
143143
* `fdr`, `fwe`.
144144
* - `numTopFeatures` chooses a fixed number of top features according to a chi-squared test.
145145
* - `percentile` is similar but chooses a fraction of all features instead of a fixed number.
146-
* - `fpr` chooses all features whose p-value is below a threshold, thus controlling the false
146+
* - `fpr` chooses all features whose p-value are below a threshold, thus controlling the false
147147
* positive rate of selection.
148148
* - `fdr` uses the [Benjamini-Hochberg procedure]
149149
* (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
150150
* to choose all features whose false discovery rate is below a threshold.
151-
* - `fwe` chooses all features whose p-values is below a threshold,
152-
* thus controlling the family-wise error rate of selection.
151+
* - `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by
152+
* 1/numFeatures, thus controlling the family-wise error rate of selection.
153153
* By default, the selection method is `numTopFeatures`, with the default number of top features
154154
* set to 50.
155155
*/

‎mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala

+3-3
Original file line numberDiff line numberDiff line change
@@ -175,13 +175,13 @@ object ChiSqSelectorModel extends Loader[ChiSqSelectorModel] {
175175
* `fdr`, `fwe`.
176176
* - `numTopFeatures` chooses a fixed number of top features according to a chi-squared test.
177177
* - `percentile` is similar but chooses a fraction of all features instead of a fixed number.
178-
* - `fpr` chooses all features whose p-value is below a threshold, thus controlling the false
178+
* - `fpr` chooses all features whose p-values are below a threshold, thus controlling the false
179179
* positive rate of selection.
180180
* - `fdr` uses the [Benjamini-Hochberg procedure]
181181
* (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
182182
* to choose all features whose false discovery rate is below a threshold.
183-
* - `fwe` chooses all features whose p-values is below a threshold,
184-
* thus controlling the family-wise error rate of selection.
183+
* - `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by
184+
* 1/numFeatures, thus controlling the family-wise error rate of selection.
185185
* By default, the selection method is `numTopFeatures`, with the default number of top features
186186
* set to 50.
187187
*/

‎mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala

+78-17
Original file line numberDiff line numberDiff line change
@@ -35,22 +35,77 @@ class ChiSqSelectorSuite extends SparkFunSuite with MLlibTestSparkContext
3535

3636
// Toy dataset, including the top feature for a chi-squared test.
3737
// These data are chosen such that each feature's test has a distinct p-value.
38-
/* To verify the results with R, run:
39-
library(stats)
40-
x1 <- c(8.0, 0.0, 0.0, 7.0, 8.0)
41-
x2 <- c(7.0, 9.0, 9.0, 9.0, 7.0)
42-
x3 <- c(0.0, 6.0, 8.0, 5.0, 3.0)
43-
y <- c(0.0, 1.0, 1.0, 2.0, 2.0)
44-
chisq.test(x1,y)
45-
chisq.test(x2,y)
46-
chisq.test(x3,y)
38+
/*
39+
* Contingency tables
40+
* feature1 = {6.0, 0.0, 8.0}
41+
* class 0 1 2
42+
* 6.0||1|0|0|
43+
* 0.0||0|3|0|
44+
* 8.0||0|0|2|
45+
* degree of freedom = 4, statistic = 12, pValue = 0.017
46+
*
47+
* feature2 = {7.0, 9.0}
48+
* class 0 1 2
49+
* 7.0||1|0|0|
50+
* 9.0||0|3|2|
51+
* degree of freedom = 2, statistic = 6, pValue = 0.049
52+
*
53+
* feature3 = {0.0, 6.0, 3.0, 8.0}
54+
* class 0 1 2
55+
* 0.0||1|0|0|
56+
* 6.0||0|1|2|
57+
* 3.0||0|1|0|
58+
* 8.0||0|1|0|
59+
* degree of freedom = 6, statistic = 8.66, pValue = 0.193
60+
*
61+
* feature4 = {7.0, 0.0, 5.0, 4.0}
62+
* class 0 1 2
63+
* 7.0||1|0|0|
64+
* 0.0||0|2|0|
65+
* 5.0||0|1|1|
66+
* 4.0||0|0|1|
67+
* degree of freedom = 6, statistic = 9.5, pValue = 0.147
68+
*
69+
* feature5 = {6.0, 5.0, 4.0, 0.0}
70+
* class 0 1 2
71+
* 6.0||1|1|0|
72+
* 5.0||0|2|0|
73+
* 4.0||0|0|1|
74+
* 0.0||0|0|1|
75+
* degree of freedom = 6, statistic = 8.0, pValue = 0.238
76+
*
77+
* feature6 = {0.0, 9.0, 5.0, 4.0}
78+
* class 0 1 2
79+
* 0.0||1|0|1|
80+
* 9.0||0|1|0|
81+
* 5.0||0|1|0|
82+
* 4.0||0|1|1|
83+
* degree of freedom = 6, statistic = 5, pValue = 0.54
84+
*
85+
* To verify the results with R, run:
86+
* library(stats)
87+
* x1 <- c(6.0, 0.0, 0.0, 0.0, 8.0, 8.0)
88+
* x2 <- c(7.0, 9.0, 9.0, 9.0, 9.0, 9.0)
89+
* x3 <- c(0.0, 6.0, 3.0, 8.0, 6.0, 6.0)
90+
* x4 <- c(7.0, 0.0, 0.0, 5.0, 5.0, 4.0)
91+
* x5 <- c(6.0, 5.0, 5.0, 6.0, 4.0, 0.0)
92+
* x6 <- c(0.0, 9.0, 5.0, 4.0, 4.0, 0.0)
93+
* y <- c(0.0, 1.0, 1.0, 1.0, 2.0, 2.0)
94+
* chisq.test(x1,y)
95+
* chisq.test(x2,y)
96+
* chisq.test(x3,y)
97+
* chisq.test(x4,y)
98+
* chisq.test(x5,y)
99+
* chisq.test(x6,y)
47100
*/
101+
48102
dataset = spark.createDataFrame(Seq(
49-
(0.0, Vectors.sparse(3, Array((0, 8.0), (1, 7.0))), Vectors.dense(8.0)),
50-
(1.0, Vectors.sparse(3, Array((1, 9.0), (2, 6.0))), Vectors.dense(0.0)),
51-
(1.0, Vectors.dense(Array(0.0, 9.0, 8.0)), Vectors.dense(0.0)),
52-
(2.0, Vectors.dense(Array(7.0, 9.0, 5.0)), Vectors.dense(7.0)),
53-
(2.0, Vectors.dense(Array(8.0, 7.0, 3.0)), Vectors.dense(8.0))
103+
(0.0, Vectors.sparse(6, Array((0, 6.0), (1, 7.0), (3, 7.0), (4, 6.0))), Vectors.dense(6.0)),
104+
(1.0, Vectors.sparse(6, Array((1, 9.0), (2, 6.0), (4, 5.0), (5, 9.0))), Vectors.dense(0.0)),
105+
(1.0, Vectors.sparse(6, Array((1, 9.0), (2, 3.0), (4, 5.0), (5, 5.0))), Vectors.dense(0.0)),
106+
(1.0, Vectors.dense(Array(0.0, 9.0, 8.0, 5.0, 6.0, 4.0)), Vectors.dense(0.0)),
107+
(2.0, Vectors.dense(Array(8.0, 9.0, 6.0, 5.0, 4.0, 4.0)), Vectors.dense(8.0)),
108+
(2.0, Vectors.dense(Array(8.0, 9.0, 6.0, 4.0, 0.0, 0.0)), Vectors.dense(8.0))
54109
)).toDF("label", "features", "topFeature")
55110
}
56111

@@ -69,19 +124,25 @@ class ChiSqSelectorSuite extends SparkFunSuite with MLlibTestSparkContext
69124

70125
test("Test Chi-Square selector: percentile") {
71126
val selector = new ChiSqSelector()
72-
.setOutputCol("filtered").setSelectorType("percentile").setPercentile(0.34)
127+
.setOutputCol("filtered").setSelectorType("percentile").setPercentile(0.17)
73128
ChiSqSelectorSuite.testSelector(selector, dataset)
74129
}
75130

76131
test("Test Chi-Square selector: fpr") {
77132
val selector = new ChiSqSelector()
78-
.setOutputCol("filtered").setSelectorType("fpr").setFpr(0.2)
133+
.setOutputCol("filtered").setSelectorType("fpr").setFpr(0.02)
134+
ChiSqSelectorSuite.testSelector(selector, dataset)
135+
}
136+
137+
test("Test Chi-Square selector: fdr") {
138+
val selector = new ChiSqSelector()
139+
.setOutputCol("filtered").setSelectorType("fdr").setFdr(0.12)
79140
ChiSqSelectorSuite.testSelector(selector, dataset)
80141
}
81142

82143
test("Test Chi-Square selector: fwe") {
83144
val selector = new ChiSqSelector()
84-
.setOutputCol("filtered").setSelectorType("fwe").setFwe(0.6)
145+
.setOutputCol("filtered").setSelectorType("fwe").setFwe(0.12)
85146
ChiSqSelectorSuite.testSelector(selector, dataset)
86147
}
87148

‎python/pyspark/ml/feature.py

+5-4
Original file line numberDiff line numberDiff line change
@@ -2629,7 +2629,8 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, Ja
26292629
"""
26302630
.. note:: Experimental
26312631
2632-
Creates a ChiSquared feature selector.
2632+
Chi-Squared feature selection, which selects categorical features to use for predicting a
2633+
categorical label.
26332634
The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`,
26342635
`fdr`, `fwe`.
26352636
@@ -2638,15 +2639,15 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, Ja
26382639
* `percentile` is similar but chooses a fraction of all features
26392640
instead of a fixed number.
26402641
2641-
* `fpr` chooses all features whose p-value is below a threshold,
2642+
* `fpr` chooses all features whose p-values are below a threshold,
26422643
thus controlling the false positive rate of selection.
26432644
26442645
* `fdr` uses the `Benjamini-Hochberg procedure <https://en.wikipedia.org/wiki/
26452646
False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure>`_
26462647
to choose all features whose false discovery rate is below a threshold.
26472648
2648-
* `fwe` chooses all features whose p-values is below a threshold,
2649-
thus controlling the family-wise error rate of selection.
2649+
* `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by
2650+
1/numFeatures, thus controlling the family-wise error rate of selection.
26502651
26512652
By default, the selection method is `numTopFeatures`, with the default number of top features
26522653
set to 50.

‎python/pyspark/mllib/feature.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -282,15 +282,15 @@ class ChiSqSelector(object):
282282
* `percentile` is similar but chooses a fraction of all features
283283
instead of a fixed number.
284284
285-
* `fpr` chooses all features whose p-value is below a threshold,
285+
* `fpr` chooses all features whose p-values are below a threshold,
286286
thus controlling the false positive rate of selection.
287287
288288
* `fdr` uses the `Benjamini-Hochberg procedure <https://en.wikipedia.org/wiki/
289289
False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure>`_
290290
to choose all features whose false discovery rate is below a threshold.
291291
292-
* `fwe` chooses all features whose p-values is below a threshold,
293-
thus controlling the family-wise error rate of selection.
292+
* `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by
293+
1/numFeatures, thus controlling the family-wise error rate of selection.
294294
295295
By default, the selection method is `numTopFeatures`, with the default number of top features
296296
set to 50.

0 commit comments

Comments
 (0)