aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorXusen Yin <yinxusen@gmail.com>2015-12-01 15:21:53 -0800
committerJoseph K. Bradley <joseph@databricks.com>2015-12-01 15:21:53 -0800
commite76431f886ae8061545b3216e8e2fb38c4db1f43 (patch)
tree80c46e668d82144dd0b25eeaff25ca124d514ca7 /docs
parent328b757d5d4486ea3c2e246780792d7a57ee85e5 (diff)
downloadspark-e76431f886ae8061545b3216e8e2fb38c4db1f43.tar.gz
spark-e76431f886ae8061545b3216e8e2fb38c4db1f43.tar.bz2
spark-e76431f886ae8061545b3216e8e2fb38c4db1f43.zip
[SPARK-11961][DOC] Add docs of ChiSqSelector
https://issues.apache.org/jira/browse/SPARK-11961 Author: Xusen Yin <yinxusen@gmail.com> Closes #9965 from yinxusen/SPARK-11961.
Diffstat (limited to 'docs')
-rw-r--r--docs/ml-features.md50
1 files changed, 50 insertions, 0 deletions
diff --git a/docs/ml-features.md b/docs/ml-features.md
index cd1838d6d2..5f88877555 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -1949,3 +1949,53 @@ output.select("features", "label").show()
{% endhighlight %}
</div>
</div>
+
+## ChiSqSelector
+
+`ChiSqSelector` stands for Chi-Squared feature selection. It operates on labeled data with
+categorical features. ChiSqSelector orders features based on a
+[Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test)
+from the class, and then filters (selects) the top features which the class label depends on the
+most. This is akin to yielding the features with the most predictive power.
+
+**Examples**
+
+Assume that we have a DataFrame with the columns `id`, `features`, and `clicked`, which is used as
+our target to be predicted:
+
+~~~
+id | features | clicked
+---|-----------------------|---------
+ 7 | [0.0, 0.0, 18.0, 1.0] | 1.0
+ 8 | [0.0, 1.0, 12.0, 0.0] | 0.0
+ 9 | [1.0, 0.0, 15.0, 0.1] | 0.0
+~~~
+
+If we use `ChiSqSelector` with a `numTopFeatures = 1`, then according to our label `clicked` the
+last column in our `features` chosen as the most useful feature:
+
+~~~
+id | features | clicked | selectedFeatures
+---|-----------------------|---------|------------------
+ 7 | [0.0, 0.0, 18.0, 1.0] | 1.0 | [1.0]
+ 8 | [0.0, 1.0, 12.0, 0.0] | 0.0 | [0.0]
+ 9 | [1.0, 0.0, 15.0, 0.1] | 0.0 | [0.1]
+~~~
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+Refer to the [ChiSqSelector Scala docs](api/scala/index.html#org.apache.spark.ml.feature.ChiSqSelector)
+for more details on the API.
+
+{% include_example scala/org/apache/spark/examples/ml/ChiSqSelectorExample.scala %}
+</div>
+
+<div data-lang="java" markdown="1">
+
+Refer to the [ChiSqSelector Java docs](api/java/org/apache/spark/ml/feature/ChiSqSelector.html)
+for more details on the API.
+
+{% include_example java/org/apache/spark/examples/ml/JavaChiSqSelectorExample.java %}
+</div>
+</div>