aboutsummaryrefslogtreecommitdiff
path: root/python
diff options
context:
space:
mode:
authorPeng, Meng <peng.meng@intel.com>2017-01-10 13:09:58 +0000
committerSean Owen <sowen@cloudera.com>2017-01-10 13:09:58 +0000
commit32286ba68af03af6b9ff50d5dece050e5417307a (patch)
tree85d945c4bc531e91ae05bda2c85559660b6d02c8 /python
parentacfc5f354332107cc744fb636e3730f6fc48b2fe (diff)
downloadspark-32286ba68af03af6b9ff50d5dece050e5417307a.tar.gz
spark-32286ba68af03af6b9ff50d5dece050e5417307a.tar.bz2
spark-32286ba68af03af6b9ff50d5dece050e5417307a.zip
[SPARK-17645][MLLIB][ML][FOLLOW-UP] document minor change
## What changes were proposed in this pull request? Add FDR test case in ml/feature/ChiSqSelectorSuite. Improve some comments in the code. This is a follow-up pr for #15212. ## How was this patch tested? ut Author: Peng, Meng <peng.meng@intel.com> Closes #16434 from mpjlu/fdr_fwe_update.
Diffstat (limited to 'python')
-rwxr-xr-xpython/pyspark/ml/feature.py9
-rw-r--r--python/pyspark/mllib/feature.py6
2 files changed, 8 insertions, 7 deletions
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index dbd17e01d2..ac90c899d9 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -2629,7 +2629,8 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, Ja
"""
.. note:: Experimental
- Creates a ChiSquared feature selector.
+ Chi-Squared feature selection, which selects categorical features to use for predicting a
+ categorical label.
The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`,
`fdr`, `fwe`.
@@ -2638,15 +2639,15 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, Ja
* `percentile` is similar but chooses a fraction of all features
instead of a fixed number.
- * `fpr` chooses all features whose p-value is below a threshold,
+ * `fpr` chooses all features whose p-values are below a threshold,
thus controlling the false positive rate of selection.
* `fdr` uses the `Benjamini-Hochberg procedure <https://en.wikipedia.org/wiki/
False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure>`_
to choose all features whose false discovery rate is below a threshold.
- * `fwe` chooses all features whose p-values is below a threshold,
- thus controlling the family-wise error rate of selection.
+ * `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by
+ 1/numFeatures, thus controlling the family-wise error rate of selection.
By default, the selection method is `numTopFeatures`, with the default number of top features
set to 50.
diff --git a/python/pyspark/mllib/feature.py b/python/pyspark/mllib/feature.py
index 61f2bc7492..e5231dc3a2 100644
--- a/python/pyspark/mllib/feature.py
+++ b/python/pyspark/mllib/feature.py
@@ -282,15 +282,15 @@ class ChiSqSelector(object):
* `percentile` is similar but chooses a fraction of all features
instead of a fixed number.
- * `fpr` chooses all features whose p-value is below a threshold,
+ * `fpr` chooses all features whose p-values are below a threshold,
thus controlling the false positive rate of selection.
* `fdr` uses the `Benjamini-Hochberg procedure <https://en.wikipedia.org/wiki/
False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure>`_
to choose all features whose false discovery rate is below a threshold.
- * `fwe` chooses all features whose p-values is below a threshold,
- thus controlling the family-wise error rate of selection.
+ * `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by
+ 1/numFeatures, thus controlling the family-wise error rate of selection.
By default, the selection method is `numTopFeatures`, with the default number of top features
set to 50.