diff options
author | Shuai Lin <linshuai2012@gmail.com> | 2016-09-28 06:12:48 -0400 |
---|---|---|
committer | Sean Owen <sowen@cloudera.com> | 2016-09-28 06:12:48 -0400 |
commit | b2a7eedcddf0e682ff46afd1b764d0b81ccdf395 (patch) | |
tree | 9c31d3a1d2945b3a207a55219c0f8e9591b7b47d /docs/ml-features.md | |
parent | 4a83395681e0bca356363a6cfb25c952f235560d (diff) | |
download | spark-b2a7eedcddf0e682ff46afd1b764d0b81ccdf395.tar.gz spark-b2a7eedcddf0e682ff46afd1b764d0b81ccdf395.tar.bz2 spark-b2a7eedcddf0e682ff46afd1b764d0b81ccdf395.zip |
[SPARK-17017][ML][MLLIB][ML][DOC] Updated the ml/mllib feature selection docs for ChiSqSelector
## What changes were proposed in this pull request?
A follow up for #14597 to update feature selection docs about ChiSqSelector.
## How was this patch tested?
Generated html docs. It can be previewed at:
* ml: http://sparkdocs.lins05.pw/spark-17017/ml-features.html#chisqselector
* mllib: http://sparkdocs.lins05.pw/spark-17017/mllib-feature-extraction.html#chisqselector
Author: Shuai Lin <linshuai2012@gmail.com>
Closes #15236 from lins05/spark-17017-update-docs-for-chisq-selector-fpr.
Diffstat (limited to 'docs/ml-features.md')
-rw-r--r-- | docs/ml-features.md | 14 |
1 files changed, 10 insertions, 4 deletions
diff --git a/docs/ml-features.md b/docs/ml-features.md index a39b31c8f7..a7f710fa52 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -1331,10 +1331,16 @@ for more details on the API. ## ChiSqSelector `ChiSqSelector` stands for Chi-Squared feature selection. It operates on labeled data with -categorical features. ChiSqSelector orders features based on a -[Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) -from the class, and then filters (selects) the top features which the class label depends on the -most. This is akin to yielding the features with the most predictive power. +categorical features. ChiSqSelector uses the +[Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which +features to choose. It supports three selection methods: `KBest`, `Percentile` and `FPR`: + +* `KBest` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. +* `Percentile` is similar to `KBest` but chooses a fraction of all features instead of a fixed number. +* `FPR` chooses all features whose false positive rate meets some threshold. + +By default, the selection method is `KBest`, the default number of top features is 50. User can use +`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection methods. **Examples** |