[SPARK-5361]Multiple Java RDD <-> Python RDD conversions not working correctly

This is found through reading RDD from `sc.newAPIHadoopRDD` and writing it back using `rdd.saveAsNewAPIHadoopFile` in pyspark. It turns out that whenever there are multiple RDD conversions from JavaRDD to PythonRDD then back to JavaRDD, the exception below happens: ``` 15/01/16 10:28:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 7) java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to java.util.ArrayList at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:157) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) ``` The test case code below reproduces it: ``` from pyspark.rdd import RDD dl = [ (u'2', {u'director': u'David Lean'}), (u'7', {u'director': u'Andrew Dominik'}) ] dl_rdd = sc.parallelize(dl) tmp = dl_rdd._to_java_object_rdd() tmp2 = sc._jvm.SerDe.javaToPython(tmp) t = RDD(tmp2, sc) t.count() tmp = t._to_java_object_rdd() tmp2 = sc._jvm.SerDe.javaToPython(tmp) t = RDD(tmp2, sc) t.count() # it blows up here during the 2nd time of conversion ``` Author: Winston Chen <wchen@quid.com> Closes #4146 from wingchen/master and squashes the following commits: 903df7d [Winston Chen] SPARK-5361, update to toSeq based on the PR 5d90a83 [Winston Chen] SPARK-5361, make python pretty, so to pass PEP 8 checks 126be6b [Winston Chen] SPARK-5361, add in test case 4cf1187 [Winston Chen] SPARK-5361, add in test case 9f1a097 [Winston Chen] add in tuple handling while converting form python RDD back to JavaRDD
author: Winston Chen <wchen@quid.com> 2015-01-28 11:08:44 -0800
committer: Josh Rosen <joshrosen@databricks.com> 2015-01-28 11:08:44 -0800
commit: 453d7999b88be87bda30d9e73038eb484ee063bd (patch)
tree: a84e81c187a29c818ec0d384868e80f01f805e48 /core/src
parent: 0b35fcd7f01044e86669bac93e9663277c86365b (diff)
download: spark-453d7999b88be87bda30d9e73038eb484ee063bd.tar.gz
spark-453d7999b88be87bda30d9e73038eb484ee063bd.tar.bz2
spark-453d7999b88be87bda30d9e73038eb484ee063bd.zip
2 files changed, 9 insertions, 1 deletions
diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonHadoopUtil.scala b/core/src/main/scala/org/apache/spark/api/python/PythonHadoopUtil.scala
index 5ba66178e2..c9181a29d4 100644
--- a/core/src/main/scala/org/apache/spark/api/python/PythonHadoopUtil.scala
+++ b/core/src/main/scala/org/apache/spark/api/python/PythonHadoopUtil.scala
@@ -138,6 +138,11 @@ private[python] class JavaToWritableConverter extends Converter[Any, Writable] {
           mapWritable.put(convertToWritable(k), convertToWritable(v))
         }
         mapWritable
+      case array: Array[Any] => {
+        val arrayWriteable = new ArrayWritable(classOf[Writable])
+        arrayWriteable.set(array.map(convertToWritable(_)))
+        arrayWriteable
+      }
       case other => throw new SparkException(
         s"Data of type ${other.getClass.getName} cannot be used")
     }
diff --git a/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala b/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala
index a4153aaa92..19ca2bb613 100644
--- a/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala
+++ b/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala
@@ -153,7 +153,10 @@ private[spark] object SerDeUtil extends Logging {
       iter.flatMap { row =>
         val obj = unpickle.loads(row)
         if (batched) {
-          obj.asInstanceOf[JArrayList[_]].asScala
+          obj match {
+            case array: Array[Any] => array.toSeq
+            case _ => obj.asInstanceOf[JArrayList[_]].asScala
+          }
         } else {
           Seq(obj)
         }
author	Winston Chen <wchen@quid.com>	2015-01-28 11:08:44 -0800
committer	Josh Rosen <joshrosen@databricks.com>	2015-01-28 11:08:44 -0800
commit	453d7999b88be87bda30d9e73038eb484ee063bd (patch)
tree	a84e81c187a29c818ec0d384868e80f01f805e48 /core/src
parent	0b35fcd7f01044e86669bac93e9663277c86365b (diff)
download	spark-453d7999b88be87bda30d9e73038eb484ee063bd.tar.gz spark-453d7999b88be87bda30d9e73038eb484ee063bd.tar.bz2 spark-453d7999b88be87bda30d9e73038eb484ee063bd.zip