[SPARK-16356][FOLLOW-UP][ML] Enforce ML test of exception for local/distributed Dataset.

## What changes were proposed in this pull request? #14035 added ```testImplicits``` to ML unit tests and promoted ```toDF()```, but left one minor issue at ```VectorIndexerSuite```. If we create the DataFrame by ```Seq(...).toDF()```, it will throw different error/exception compared with ```sc.parallelize(Seq(...)).toDF()``` for one of the test cases. After in-depth study, I found it was caused by different behavior of local and distributed Dataset if the UDF failed at ```assert```. If the data is local Dataset, it throws ```AssertionError``` directly; If the data is distributed Dataset, it throws ```SparkException``` which is the wrapper of ```AssertionError```. I think we should enforce this test to cover both case. ## How was this patch tested? Unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15261 from yanboliang/spark-16356.
author: Yanbo Liang <ybliang8@gmail.com> 2016-09-29 00:54:26 -0700
committer: Yanbo Liang <ybliang8@gmail.com> 2016-09-29 00:54:26 -0700
commit: a19a1bb59411177caaf99581e89098826b7d0c7b (patch)
tree: 649a504d904cce2f0783def6e0114ab68a9e1024
parent: 37eb9184f1e9f1c07142c66936671f4711ef407d (diff)
download: spark-a19a1bb59411177caaf99581e89098826b7d0c7b.tar.gz
spark-a19a1bb59411177caaf99581e89098826b7d0c7b.tar.bz2
spark-a19a1bb59411177caaf99581e89098826b7d0c7b.zip
1 files changed, 9 insertions, 4 deletions
diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala
index 4da1b133e8..b28ce2ab45 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala
@@ -88,9 +88,7 @@ class VectorIndexerSuite extends SparkFunSuite with MLlibTestSparkContext
 
     densePoints1 = densePoints1Seq.map(FeatureData).toDF()
     sparsePoints1 = sparsePoints1Seq.map(FeatureData).toDF()
-    // TODO: If we directly use `toDF` without parallelize, the test in
-    // "Throws error when given RDDs with different size vectors" is failed for an unknown reason.
-    densePoints2 = sc.parallelize(densePoints2Seq, 2).map(FeatureData).toDF()
+    densePoints2 = densePoints2Seq.map(FeatureData).toDF()
     sparsePoints2 = sparsePoints2Seq.map(FeatureData).toDF()
     badPoints = badPointsSeq.map(FeatureData).toDF()
   }
@@ -121,10 +119,17 @@ class VectorIndexerSuite extends SparkFunSuite with MLlibTestSparkContext
 
     model.transform(densePoints1) // should work
     model.transform(sparsePoints1) // should work
-    intercept[SparkException] {
+    // If the data is local Dataset, it throws AssertionError directly.
+    intercept[AssertionError] {
       model.transform(densePoints2).collect()
       logInfo("Did not throw error when fit, transform were called on vectors of different lengths")
     }
+    // If the data is distributed Dataset, it throws SparkException
+    // which is the wrapper of AssertionError.
+    intercept[SparkException] {
+      model.transform(densePoints2.repartition(2)).collect()
+      logInfo("Did not throw error when fit, transform were called on vectors of different lengths")
+    }
     intercept[SparkException] {
       vectorIndexer.fit(badPoints)
       logInfo("Did not throw error when fitting vectors of different lengths in same RDD.")
author	Yanbo Liang <ybliang8@gmail.com>	2016-09-29 00:54:26 -0700
committer	Yanbo Liang <ybliang8@gmail.com>	2016-09-29 00:54:26 -0700
commit	a19a1bb59411177caaf99581e89098826b7d0c7b (patch)
tree	649a504d904cce2f0783def6e0114ab68a9e1024
parent	37eb9184f1e9f1c07142c66936671f4711ef407d (diff)
download	spark-a19a1bb59411177caaf99581e89098826b7d0c7b.tar.gz spark-a19a1bb59411177caaf99581e89098826b7d0c7b.tar.bz2 spark-a19a1bb59411177caaf99581e89098826b7d0c7b.zip