aboutsummaryrefslogtreecommitdiff
path: root/docs/mllib-feature-extraction.md
diff options
context:
space:
mode:
authorJoseph K. Bradley <joseph@databricks.com>2015-02-23 16:15:57 -0800
committerXiangrui Meng <meng@databricks.com>2015-02-23 16:15:57 -0800
commit59536cc87e10e5011560556729dd901280958f43 (patch)
tree5b3340929bc18e849dc31b514895ed3557084102 /docs/mllib-feature-extraction.md
parent28ccf5ee769a1df019e38985112065c01724fbd9 (diff)
downloadspark-59536cc87e10e5011560556729dd901280958f43.tar.gz
spark-59536cc87e10e5011560556729dd901280958f43.tar.bz2
spark-59536cc87e10e5011560556729dd901280958f43.zip
[SPARK-5912] [docs] [mllib] Small fixes to ChiSqSelector docs
Fixes: * typo in Scala example * Removed comment "usually applied on sparse data" since that is debatable * small edits to text for clarity CC: avulanov I noticed a typo post-hoc and ended up making a few small edits. Do the changes look OK? Author: Joseph K. Bradley <joseph@databricks.com> Closes #4732 from jkbradley/chisqselector-docs and squashes the following commits: 9656a3b [Joseph K. Bradley] added Java example for ChiSqSelector to guide 3f3f9f4 [Joseph K. Bradley] small fixes to ChiSqSelector docs
Diffstat (limited to 'docs/mllib-feature-extraction.md')
-rw-r--r--docs/mllib-feature-extraction.md72
1 files changed, 60 insertions, 12 deletions
diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md
index d588b9cb46..80842b27ef 100644
--- a/docs/mllib-feature-extraction.md
+++ b/docs/mllib-feature-extraction.md
@@ -377,27 +377,27 @@ data2 = labels.zip(normalizer2.transform(features))
</div>
## Feature selection
-[Feature selection](http://en.wikipedia.org/wiki/Feature_selection) allows selecting the most relevant features for use in model construction. The number of features to select can be determined using the validation set. Feature selection is usually applied on sparse data, for example in text classification. Feature selection reduces the size of the vector space and, in turn, the complexity of any subsequent operation with vectors.
+[Feature selection](http://en.wikipedia.org/wiki/Feature_selection) allows selecting the most relevant features for use in model construction. Feature selection reduces the size of the vector space and, in turn, the complexity of any subsequent operation with vectors. The number of features to select can be tuned using a held-out validation set.
### ChiSqSelector
-ChiSqSelector stands for Chi-Squared feature selection. It operates on the labeled data. ChiSqSelector orders categorical features based on their values of Chi-Squared test on independence from class and filters (selects) top given features.
+[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) stands for Chi-Squared feature selection. It operates on labeled data with categorical features. `ChiSqSelector` orders features based on a Chi-Squared test of independence from the class, and then filters (selects) the top features which are most closely related to the label.
#### Model Fitting
[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) has the
following parameters in the constructor:
-* `numTopFeatures` number of top features that selector will select (filter).
+* `numTopFeatures` number of top features that the selector will select (filter).
We provide a [`fit`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) method in
`ChiSqSelector` which can take an input of `RDD[LabeledPoint]` with categorical features, learn the summary statistics, and then
-return a model which can transform the input dataset into the reduced feature space.
+return a `ChiSqSelectorModel` which can transform an input dataset into the reduced feature space.
This model implements [`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer)
which can apply the Chi-Squared feature selection on a `Vector` to produce a reduced `Vector` or on
an `RDD[Vector]` to produce a reduced `RDD[Vector]`.
-Note that the model that performs actual feature filtering can be instantiated independently with array of feature indices that has to be sorted ascending.
+Note that the user can also construct a `ChiSqSelectorModel` by hand by providing an array of selected feature indices (which must be sorted in ascending order).
#### Example
@@ -411,21 +411,69 @@ import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils
-// load some data in libsvm format, each point is in the range 0..255
+// Load some data in libsvm format
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
-// discretize data in 16 equal bins
+// Discretize data in 16 equal bins since ChiSqSelector requires categorical features
val discretizedData = data.map { lp =>
LabeledPoint(lp.label, Vectors.dense(lp.features.toArray.map { x => x / 16 } ) )
}
-// create ChiSqSelector that will select 50 features
+// Create ChiSqSelector that will select 50 features
val selector = new ChiSqSelector(50)
-// create ChiSqSelector model
-val transformer = selector.fit(disctetizedData)
-// filter top 50 features from each feature vector
-val filteredData = disctetizedData.map { lp =>
+// Create ChiSqSelector model (selecting features)
+val transformer = selector.fit(discretizedData)
+// Filter the top 50 features from each feature vector
+val filteredData = discretizedData.map { lp =>
LabeledPoint(lp.label, transformer.transform(lp.features))
}
{% endhighlight %}
</div>
+
+<div data-lang="java">
+{% highlight java %}
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.function.Function;
+import org.apache.spark.mllib.feature.ChiSqSelector;
+import org.apache.spark.mllib.feature.ChiSqSelectorModel;
+import org.apache.spark.mllib.linalg.Vectors;
+import org.apache.spark.mllib.regression.LabeledPoint;
+import org.apache.spark.mllib.util.MLUtils;
+
+SparkConf sparkConf = new SparkConf().setAppName("JavaChiSqSelector");
+JavaSparkContext sc = new JavaSparkContext(sparkConf);
+JavaRDD<LabeledPoint> points = MLUtils.loadLibSVMFile(sc.sc(),
+ "data/mllib/sample_libsvm_data.txt").toJavaRDD().cache();
+
+// Discretize data in 16 equal bins since ChiSqSelector requires categorical features
+JavaRDD<LabeledPoint> discretizedData = points.map(
+ new Function<LabeledPoint, LabeledPoint>() {
+ @Override
+ public LabeledPoint call(LabeledPoint lp) {
+ final double[] discretizedFeatures = new double[lp.features().size()];
+ for (int i = 0; i < lp.features().size(); ++i) {
+ discretizedFeatures[i] = lp.features().apply(i) / 16;
+ }
+ return new LabeledPoint(lp.label(), Vectors.dense(discretizedFeatures));
+ }
+ });
+
+// Create ChiSqSelector that will select 50 features
+ChiSqSelector selector = new ChiSqSelector(50);
+// Create ChiSqSelector model (selecting features)
+final ChiSqSelectorModel transformer = selector.fit(discretizedData.rdd());
+// Filter the top 50 features from each feature vector
+JavaRDD<LabeledPoint> filteredData = discretizedData.map(
+ new Function<LabeledPoint, LabeledPoint>() {
+ @Override
+ public LabeledPoint call(LabeledPoint lp) {
+ return new LabeledPoint(lp.label(), transformer.transform(lp.features()));
+ }
+ }
+);
+
+sc.stop();
+{% endhighlight %}
+</div>
</div>