diff options
author | Winston Chen <wchen@quid.com> | 2015-01-28 11:08:44 -0800 |
---|---|---|
committer | Josh Rosen <joshrosen@databricks.com> | 2015-01-28 11:08:44 -0800 |
commit | 453d7999b88be87bda30d9e73038eb484ee063bd (patch) | |
tree | a84e81c187a29c818ec0d384868e80f01f805e48 /python | |
parent | 0b35fcd7f01044e86669bac93e9663277c86365b (diff) | |
download | spark-453d7999b88be87bda30d9e73038eb484ee063bd.tar.gz spark-453d7999b88be87bda30d9e73038eb484ee063bd.tar.bz2 spark-453d7999b88be87bda30d9e73038eb484ee063bd.zip |
[SPARK-5361]Multiple Java RDD <-> Python RDD conversions not working correctly
This is found through reading RDD from `sc.newAPIHadoopRDD` and writing it back using `rdd.saveAsNewAPIHadoopFile` in pyspark.
It turns out that whenever there are multiple RDD conversions from JavaRDD to PythonRDD then back to JavaRDD, the exception below happens:
```
15/01/16 10:28:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 7)
java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to java.util.ArrayList
at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:157)
at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
```
The test case code below reproduces it:
```
from pyspark.rdd import RDD
dl = [
(u'2', {u'director': u'David Lean'}),
(u'7', {u'director': u'Andrew Dominik'})
]
dl_rdd = sc.parallelize(dl)
tmp = dl_rdd._to_java_object_rdd()
tmp2 = sc._jvm.SerDe.javaToPython(tmp)
t = RDD(tmp2, sc)
t.count()
tmp = t._to_java_object_rdd()
tmp2 = sc._jvm.SerDe.javaToPython(tmp)
t = RDD(tmp2, sc)
t.count() # it blows up here during the 2nd time of conversion
```
Author: Winston Chen <wchen@quid.com>
Closes #4146 from wingchen/master and squashes the following commits:
903df7d [Winston Chen] SPARK-5361, update to toSeq based on the PR
5d90a83 [Winston Chen] SPARK-5361, make python pretty, so to pass PEP 8 checks
126be6b [Winston Chen] SPARK-5361, add in test case
4cf1187 [Winston Chen] SPARK-5361, add in test case
9f1a097 [Winston Chen] add in tuple handling while converting form python RDD back to JavaRDD
Diffstat (limited to 'python')
-rw-r--r-- | python/pyspark/tests.py | 19 |
1 files changed, 19 insertions, 0 deletions
diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py index e8e207af46..e694ffcff5 100644 --- a/python/pyspark/tests.py +++ b/python/pyspark/tests.py @@ -714,6 +714,25 @@ class RDDTests(ReusedPySparkTestCase): wr_s21 = rdd.sample(True, 0.4, 21).collect() self.assertNotEqual(set(wr_s11), set(wr_s21)) + def test_multiple_python_java_RDD_conversions(self): + # Regression test for SPARK-5361 + data = [ + (u'1', {u'director': u'David Lean'}), + (u'2', {u'director': u'Andrew Dominik'}) + ] + from pyspark.rdd import RDD + data_rdd = self.sc.parallelize(data) + data_java_rdd = data_rdd._to_java_object_rdd() + data_python_rdd = self.sc._jvm.SerDe.javaToPython(data_java_rdd) + converted_rdd = RDD(data_python_rdd, self.sc) + self.assertEqual(2, converted_rdd.count()) + + # conversion between python and java RDD threw exceptions + data_java_rdd = converted_rdd._to_java_object_rdd() + data_python_rdd = self.sc._jvm.SerDe.javaToPython(data_java_rdd) + converted_rdd = RDD(data_python_rdd, self.sc) + self.assertEqual(2, converted_rdd.count()) + class ProfilerTests(PySparkTestCase): |