diff options
author | Joseph K. Bradley <joseph@databricks.com> | 2016-11-01 17:00:00 -0700 |
---|---|---|
committer | Joseph K. Bradley <joseph@databricks.com> | 2016-11-01 17:00:00 -0700 |
commit | 91c33a0ca5c8287f710076ed7681e5aa13ca068f (patch) | |
tree | ea3e24b067e3b7ba1f340f0ed7906c80a64a36bd /docs/ml-features.md | |
parent | b929537b6eb0f8f34497c3dbceea8045bf5dffdb (diff) | |
download | spark-91c33a0ca5c8287f710076ed7681e5aa13ca068f.tar.gz spark-91c33a0ca5c8287f710076ed7681e5aa13ca068f.tar.bz2 spark-91c33a0ca5c8287f710076ed7681e5aa13ca068f.zip |
[SPARK-18088][ML] Various ChiSqSelector cleanups
## What changes were proposed in this pull request?
- Renamed kbest to numTopFeatures
- Renamed alpha to fpr
- Added missing Since annotations
- Doc cleanups
## How was this patch tested?
Added new standardized unit tests for spark.ml.
Improved existing unit test coverage a bit.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #15647 from jkbradley/chisqselector-follow-ups.
Diffstat (limited to 'docs/ml-features.md')
-rw-r--r-- | docs/ml-features.md | 12 |
1 files changed, 6 insertions, 6 deletions
diff --git a/docs/ml-features.md b/docs/ml-features.md index 64c6a16023..352887d3ba 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -1338,14 +1338,14 @@ for more details on the API. `ChiSqSelector` stands for Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the [Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which -features to choose. It supports three selection methods: `KBest`, `Percentile` and `FPR`: +features to choose. It supports three selection methods: `numTopFeatures`, `percentile`, `fpr`: -* `KBest` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. -* `Percentile` is similar to `KBest` but chooses a fraction of all features instead of a fixed number. -* `FPR` chooses all features whose false positive rate meets some threshold. +* `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. +* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number. +* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection. -By default, the selection method is `KBest`, the default number of top features is 50. User can use -`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection methods. +By default, the selection method is `numTopFeatures`, with the default number of top features set to 50. +The user can choose a selection method using `setSelectorType`. **Examples** |