[SPARK-17645][MLLIB][ML][FOLLOW-UP] document minor change

## What changes were proposed in this pull request? Add FDR test case in ml/feature/ChiSqSelectorSuite. Improve some comments in the code. This is a follow-up pr for #15212. ## How was this patch tested? ut Author: Peng, Meng <peng.meng@intel.com> Closes #16434 from mpjlu/fdr_fwe_update.
author: Peng, Meng <peng.meng@intel.com> 2017-01-10 13:09:58 +0000
committer: Sean Owen <sowen@cloudera.com> 2017-01-10 13:09:58 +0000
commit: 32286ba68af03af6b9ff50d5dece050e5417307a (patch)
tree: 85d945c4bc531e91ae05bda2c85559660b6d02c8 /python/pyspark/ml/feature.py
parent: acfc5f354332107cc744fb636e3730f6fc48b2fe (diff)
download: spark-32286ba68af03af6b9ff50d5dece050e5417307a.tar.gz
spark-32286ba68af03af6b9ff50d5dece050e5417307a.tar.bz2
spark-32286ba68af03af6b9ff50d5dece050e5417307a.zip
1 files changed, 5 insertions, 4 deletions
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index dbd17e01d2..ac90c899d9 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -2629,7 +2629,8 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, Ja
     """
     .. note:: Experimental
 
-    Creates a ChiSquared feature selector.
+    Chi-Squared feature selection, which selects categorical features to use for predicting a
+    categorical label.
     The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`,
     `fdr`, `fwe`.
 
@@ -2638,15 +2639,15 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, Ja
      * `percentile` is similar but chooses a fraction of all features
        instead of a fixed number.
 
-     * `fpr` chooses all features whose p-value is below a threshold,
+     * `fpr` chooses all features whose p-values are below a threshold,
        thus controlling the false positive rate of selection.
 
      * `fdr` uses the `Benjamini-Hochberg procedure <https://en.wikipedia.org/wiki/
        False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure>`_
        to choose all features whose false discovery rate is below a threshold.
 
-     * `fwe` chooses all features whose p-values is below a threshold,
-       thus controlling the family-wise error rate of selection.
+     * `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by
+       1/numFeatures, thus controlling the family-wise error rate of selection.
 
     By default, the selection method is `numTopFeatures`, with the default number of top features
     set to 50.
author	Peng, Meng <peng.meng@intel.com>	2017-01-10 13:09:58 +0000
committer	Sean Owen <sowen@cloudera.com>	2017-01-10 13:09:58 +0000
commit	32286ba68af03af6b9ff50d5dece050e5417307a (patch)
tree	85d945c4bc531e91ae05bda2c85559660b6d02c8 /python/pyspark/ml/feature.py
parent	acfc5f354332107cc744fb636e3730f6fc48b2fe (diff)
download	spark-32286ba68af03af6b9ff50d5dece050e5417307a.tar.gz spark-32286ba68af03af6b9ff50d5dece050e5417307a.tar.bz2 spark-32286ba68af03af6b9ff50d5dece050e5417307a.zip