diff options
author | wm624@hotmail.com <wm624@hotmail.com> | 2016-04-09 09:57:07 +0200 |
---|---|---|
committer | Nick Pentreath <nick.pentreath@gmail.com> | 2016-04-09 09:57:07 +0200 |
commit | a9b8b655b25f4ed519037faaf7601a3d9842547f (patch) | |
tree | 3a50f1327b9869b61859db401f72c30fbd14e0d9 /mllib/src/test | |
parent | 90c0a04506a4972b7a2ac2b7dda0c5f8509a6e2f (diff) | |
download | spark-a9b8b655b25f4ed519037faaf7601a3d9842547f.tar.gz spark-a9b8b655b25f4ed519037faaf7601a3d9842547f.tar.bz2 spark-a9b8b655b25f4ed519037faaf7601a3d9842547f.zip |
[SPARK-14392][ML] CountVectorizer Estimator should include binary toggle Param
## What changes were proposed in this pull request?
CountVectorizerModel has a binary toggle param. This PR is to add binary toggle param for estimator CountVectorizer. As discussed in the JIRA, instead of adding a param into CountVerctorizer, I moved the binary param to CountVectorizerParams. Therefore, the estimator inherits the binary param.
## How was this patch tested?
Add a new test case, which fits the model with binary flag set to true and then check the trained model's all non-zero counts is set to 1.0.
All tests in CounterVectorizerSuite.scala are passed.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes #12200 from wangmiao1981/binary_param.
Diffstat (limited to 'mllib/src/test')
-rw-r--r-- | mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala | 19 |
1 files changed, 16 insertions, 3 deletions
diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala index 04f165c5f1..ff0de06e27 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala @@ -168,21 +168,34 @@ class CountVectorizerSuite extends SparkFunSuite with MLlibTestSparkContext } } - test("CountVectorizerModel with binary") { + test("CountVectorizerModel and CountVectorizer with binary") { val df = sqlContext.createDataFrame(Seq( - (0, split("a a a b b c"), Vectors.sparse(4, Seq((0, 1.0), (1, 1.0), (2, 1.0)))), + (0, split("a a a a b b b b c d"), + Vectors.sparse(4, Seq((0, 1.0), (1, 1.0), (2, 1.0), (3, 1.0)))), (1, split("c c c"), Vectors.sparse(4, Seq((2, 1.0)))), (2, split("a"), Vectors.sparse(4, Seq((0, 1.0)))) )).toDF("id", "words", "expected") - val cv = new CountVectorizerModel(Array("a", "b", "c", "d")) + // CountVectorizer test + val cv = new CountVectorizer() .setInputCol("words") .setOutputCol("features") .setBinary(true) + .fit(df) cv.transform(df).select("features", "expected").collect().foreach { case Row(features: Vector, expected: Vector) => assert(features ~== expected absTol 1e-14) } + + // CountVectorizerModel test + val cv2 = new CountVectorizerModel(cv.vocabulary) + .setInputCol("words") + .setOutputCol("features") + .setBinary(true) + cv2.transform(df).select("features", "expected").collect().foreach { + case Row(features: Vector, expected: Vector) => + assert(features ~== expected absTol 1e-14) + } } test("CountVectorizer read/write") { |