diff options
author | Alexander Ulanov <nashb@yandex.ru> | 2015-02-02 12:13:05 -0800 |
---|---|---|
committer | Xiangrui Meng <meng@databricks.com> | 2015-02-02 12:13:05 -0800 |
commit | c081b21b1fe4fbad845088c4144da0bd2a8d89dc (patch) | |
tree | c509dfa59591bf5ec56cf26a48ab8a62e6df4a51 /mllib/src/test | |
parent | 6f341310bf1fa59a28c96d123fa59e12b9366b68 (diff) | |
download | spark-c081b21b1fe4fbad845088c4144da0bd2a8d89dc.tar.gz spark-c081b21b1fe4fbad845088c4144da0bd2a8d89dc.tar.bz2 spark-c081b21b1fe4fbad845088c4144da0bd2a8d89dc.zip |
[MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square feature selection
The following is implemented:
1) generic traits for feature selection and filtering
2) trait for feature selection of LabeledPoint with discrete data
3) traits for calculation of contingency table and chi squared
4) class for chi-squared feature selection
5) tests for the above
Needs some optimization in matrix operations.
This request is a try to implement feature selection for MLLIB, the previous work by the issue author izendejas was not finished (https://issues.apache.org/jira/browse/SPARK-1473). This request is also related to data discretization issues: https://issues.apache.org/jira/browse/SPARK-1303 and https://issues.apache.org/jira/browse/SPARK-1216 that weren't merged.
Author: Alexander Ulanov <nashb@yandex.ru>
Closes #1484 from avulanov/featureselection and squashes the following commits:
755d358 [Alexander Ulanov] Addressing reviewers comments @mengxr
a6ad82a [Alexander Ulanov] Addressing reviewers comments @mengxr
714b878 [Alexander Ulanov] Addressing reviewers comments @mengxr
010acff [Alexander Ulanov] Rebase
427ca4e [Alexander Ulanov] Addressing reviewers comments: implement VectorTransformer interface, use Statistics.chiSqTest
f9b070a [Alexander Ulanov] Adding Apache header in tests...
80363ca [Alexander Ulanov] Tests, comments, apache headers and scala style
150a3e0 [Alexander Ulanov] Scala style fix
f356365 [Alexander Ulanov] Chi Squared by contingency table. Refactoring
2bacdc7 [Alexander Ulanov] Combinations and chi-squared values test
66e0333 [Alexander Ulanov] Feature selector, fix of lazyness
aab9b73 [Alexander Ulanov] Feature selection redesign with vigdorchik
e24eee4 [Alexander Ulanov] Traits for FeatureSelection, CombinationsCalculator and FeatureFilter
ca49e80 [Alexander Ulanov] Feature selection filter
2ade254 [Alexander Ulanov] Code style
0bd8434 [Alexander Ulanov] Chi Squared feature selection: initial version
Diffstat (limited to 'mllib/src/test')
-rw-r--r-- | mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala | 67 |
1 files changed, 67 insertions, 0 deletions
diff --git a/mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala b/mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala new file mode 100644 index 0000000000..747f591459 --- /dev/null +++ b/mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import org.scalatest.FunSuite + +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.util.MLlibTestSparkContext + +class ChiSqSelectorSuite extends FunSuite with MLlibTestSparkContext { + + /* + * Contingency tables + * feature0 = {8.0, 0.0} + * class 0 1 2 + * 8.0||1|0|1| + * 0.0||0|2|0| + * + * feature1 = {7.0, 9.0} + * class 0 1 2 + * 7.0||1|0|0| + * 9.0||0|2|1| + * + * feature2 = {0.0, 6.0, 8.0, 5.0} + * class 0 1 2 + * 0.0||1|0|0| + * 6.0||0|1|0| + * 8.0||0|1|0| + * 5.0||0|0|1| + * + * Use chi-squared calculator from Internet + */ + + test("ChiSqSelector transform test (sparse & dense vector)") { + val labeledDiscreteData = sc.parallelize( + Seq(LabeledPoint(0.0, Vectors.sparse(3, Array((0, 8.0), (1, 7.0)))), + LabeledPoint(1.0, Vectors.sparse(3, Array((1, 9.0), (2, 6.0)))), + LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 8.0))), + LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 5.0)))), 2) + val preFilteredData = + Set(LabeledPoint(0.0, Vectors.dense(Array(0.0))), + LabeledPoint(1.0, Vectors.dense(Array(6.0))), + LabeledPoint(1.0, Vectors.dense(Array(8.0))), + LabeledPoint(2.0, Vectors.dense(Array(5.0)))) + val model = new ChiSqSelector(1).fit(labeledDiscreteData) + val filteredData = labeledDiscreteData.map { lp => + LabeledPoint(lp.label, model.transform(lp.features)) + }.collect().toSet + assert(filteredData == preFilteredData) + } +} |