[SPARK-14392][ML] CountVectorizer Estimator should include binary toggle Param

## What changes were proposed in this pull request? CountVectorizerModel has a binary toggle param. This PR is to add binary toggle param for estimator CountVectorizer. As discussed in the JIRA, instead of adding a param into CountVerctorizer, I moved the binary param to CountVectorizerParams. Therefore, the estimator inherits the binary param. ## How was this patch tested? Add a new test case, which fits the model with binary flag set to true and then check the trained model's all non-zero counts is set to 1.0. All tests in CounterVectorizerSuite.scala are passed. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #12200 from wangmiao1981/binary_param.
author: wm624@hotmail.com <wm624@hotmail.com> 2016-04-09 09:57:07 +0200
committer: Nick Pentreath <nick.pentreath@gmail.com> 2016-04-09 09:57:07 +0200
commit: a9b8b655b25f4ed519037faaf7601a3d9842547f (patch)
tree: 3a50f1327b9869b61859db401f72c30fbd14e0d9 /mllib/src/test
parent: 90c0a04506a4972b7a2ac2b7dda0c5f8509a6e2f (diff)
download: spark-a9b8b655b25f4ed519037faaf7601a3d9842547f.tar.gz
spark-a9b8b655b25f4ed519037faaf7601a3d9842547f.tar.bz2
spark-a9b8b655b25f4ed519037faaf7601a3d9842547f.zip
1 files changed, 16 insertions, 3 deletions
diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala
index 04f165c5f1..ff0de06e27 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala
@@ -168,21 +168,34 @@ class CountVectorizerSuite extends SparkFunSuite with MLlibTestSparkContext
     }
   }
 
-  test("CountVectorizerModel with binary") {
+  test("CountVectorizerModel and CountVectorizer with binary") {
     val df = sqlContext.createDataFrame(Seq(
-      (0, split("a a a b b c"), Vectors.sparse(4, Seq((0, 1.0), (1, 1.0), (2, 1.0)))),
+      (0, split("a a a a b b b b c d"),
+      Vectors.sparse(4, Seq((0, 1.0), (1, 1.0), (2, 1.0), (3, 1.0)))),
       (1, split("c c c"), Vectors.sparse(4, Seq((2, 1.0)))),
       (2, split("a"), Vectors.sparse(4, Seq((0, 1.0))))
     )).toDF("id", "words", "expected")
 
-    val cv = new CountVectorizerModel(Array("a", "b", "c", "d"))
+    // CountVectorizer test
+    val cv = new CountVectorizer()
       .setInputCol("words")
       .setOutputCol("features")
       .setBinary(true)
+      .fit(df)
     cv.transform(df).select("features", "expected").collect().foreach {
       case Row(features: Vector, expected: Vector) =>
         assert(features ~== expected absTol 1e-14)
     }
+
+    // CountVectorizerModel test
+    val cv2 = new CountVectorizerModel(cv.vocabulary)
+      .setInputCol("words")
+      .setOutputCol("features")
+      .setBinary(true)
+    cv2.transform(df).select("features", "expected").collect().foreach {
+      case Row(features: Vector, expected: Vector) =>
+        assert(features ~== expected absTol 1e-14)
+    }
   }
 
   test("CountVectorizer read/write") {
author	wm624@hotmail.com <wm624@hotmail.com>	2016-04-09 09:57:07 +0200
committer	Nick Pentreath <nick.pentreath@gmail.com>	2016-04-09 09:57:07 +0200
commit	a9b8b655b25f4ed519037faaf7601a3d9842547f (patch)
tree	3a50f1327b9869b61859db401f72c30fbd14e0d9 /mllib/src/test
parent	90c0a04506a4972b7a2ac2b7dda0c5f8509a6e2f (diff)
download	spark-a9b8b655b25f4ed519037faaf7601a3d9842547f.tar.gz spark-a9b8b655b25f4ed519037faaf7601a3d9842547f.tar.bz2 spark-a9b8b655b25f4ed519037faaf7601a3d9842547f.zip