aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorShuai Lin <linshuai2012@gmail.com>2016-09-28 06:12:48 -0400
committerSean Owen <sowen@cloudera.com>2016-09-28 06:12:48 -0400
commitb2a7eedcddf0e682ff46afd1b764d0b81ccdf395 (patch)
tree9c31d3a1d2945b3a207a55219c0f8e9591b7b47d
parent4a83395681e0bca356363a6cfb25c952f235560d (diff)
downloadspark-b2a7eedcddf0e682ff46afd1b764d0b81ccdf395.tar.gz
spark-b2a7eedcddf0e682ff46afd1b764d0b81ccdf395.tar.bz2
spark-b2a7eedcddf0e682ff46afd1b764d0b81ccdf395.zip
[SPARK-17017][ML][MLLIB][ML][DOC] Updated the ml/mllib feature selection docs for ChiSqSelector
## What changes were proposed in this pull request? A follow up for #14597 to update feature selection docs about ChiSqSelector. ## How was this patch tested? Generated html docs. It can be previewed at: * ml: http://sparkdocs.lins05.pw/spark-17017/ml-features.html#chisqselector * mllib: http://sparkdocs.lins05.pw/spark-17017/mllib-feature-extraction.html#chisqselector Author: Shuai Lin <linshuai2012@gmail.com> Closes #15236 from lins05/spark-17017-update-docs-for-chisq-selector-fpr.
-rw-r--r--docs/ml-features.md14
-rw-r--r--docs/mllib-feature-extraction.md14
2 files changed, 20 insertions, 8 deletions
diff --git a/docs/ml-features.md b/docs/ml-features.md
index a39b31c8f7..a7f710fa52 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -1331,10 +1331,16 @@ for more details on the API.
## ChiSqSelector
`ChiSqSelector` stands for Chi-Squared feature selection. It operates on labeled data with
-categorical features. ChiSqSelector orders features based on a
-[Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test)
-from the class, and then filters (selects) the top features which the class label depends on the
-most. This is akin to yielding the features with the most predictive power.
+categorical features. ChiSqSelector uses the
+[Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
+features to choose. It supports three selection methods: `KBest`, `Percentile` and `FPR`:
+
+* `KBest` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
+* `Percentile` is similar to `KBest` but chooses a fraction of all features instead of a fixed number.
+* `FPR` chooses all features whose false positive rate meets some threshold.
+
+By default, the selection method is `KBest`, the default number of top features is 50. User can use
+`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection methods.
**Examples**
diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md
index 353d391249..87e1e027e9 100644
--- a/docs/mllib-feature-extraction.md
+++ b/docs/mllib-feature-extraction.md
@@ -225,10 +225,16 @@ features for use in model construction. It reduces the size of the feature space
both speed and statistical learning behavior.
[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) implements
-Chi-Squared feature selection. It operates on labeled data with categorical features.
-`ChiSqSelector` orders features based on a Chi-Squared test of independence from the class,
-and then filters (selects) the top features which the class label depends on the most.
-This is akin to yielding the features with the most predictive power.
+Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the
+[Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
+features to choose. It supports three selection methods: `KBest`, `Percentile` and `FPR`:
+
+* `KBest` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
+* `Percentile` is similar to `KBest` but chooses a fraction of all features instead of a fixed number.
+* `FPR` chooses all features whose false positive rate meets some threshold.
+
+By default, the selection method is `KBest`, the default number of top features is 50. User can use
+`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection methods.
The number of features to select can be tuned using a held-out validation set.