diff options
author | Peng, Meng <peng.meng@intel.com> | 2017-01-10 13:09:58 +0000 |
---|---|---|
committer | Sean Owen <sowen@cloudera.com> | 2017-01-10 13:09:58 +0000 |
commit | 32286ba68af03af6b9ff50d5dece050e5417307a (patch) | |
tree | 85d945c4bc531e91ae05bda2c85559660b6d02c8 /python/pyspark/ml/feature.py | |
parent | acfc5f354332107cc744fb636e3730f6fc48b2fe (diff) | |
download | spark-32286ba68af03af6b9ff50d5dece050e5417307a.tar.gz spark-32286ba68af03af6b9ff50d5dece050e5417307a.tar.bz2 spark-32286ba68af03af6b9ff50d5dece050e5417307a.zip |
[SPARK-17645][MLLIB][ML][FOLLOW-UP] document minor change
## What changes were proposed in this pull request?
Add FDR test case in ml/feature/ChiSqSelectorSuite.
Improve some comments in the code.
This is a follow-up pr for #15212.
## How was this patch tested?
ut
Author: Peng, Meng <peng.meng@intel.com>
Closes #16434 from mpjlu/fdr_fwe_update.
Diffstat (limited to 'python/pyspark/ml/feature.py')
-rwxr-xr-x | python/pyspark/ml/feature.py | 9 |
1 files changed, 5 insertions, 4 deletions
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py index dbd17e01d2..ac90c899d9 100755 --- a/python/pyspark/ml/feature.py +++ b/python/pyspark/ml/feature.py @@ -2629,7 +2629,8 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, Ja """ .. note:: Experimental - Creates a ChiSquared feature selector. + Chi-Squared feature selection, which selects categorical features to use for predicting a + categorical label. The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`, `fdr`, `fwe`. @@ -2638,15 +2639,15 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, Ja * `percentile` is similar but chooses a fraction of all features instead of a fixed number. - * `fpr` chooses all features whose p-value is below a threshold, + * `fpr` chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection. * `fdr` uses the `Benjamini-Hochberg procedure <https://en.wikipedia.org/wiki/ False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure>`_ to choose all features whose false discovery rate is below a threshold. - * `fwe` chooses all features whose p-values is below a threshold, - thus controlling the family-wise error rate of selection. + * `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by + 1/numFeatures, thus controlling the family-wise error rate of selection. By default, the selection method is `numTopFeatures`, with the default number of top features set to 50. |