[SPARK-5063] More helpful error messages for several invalid operations

This patch adds more helpful error messages for invalid programs that define nested RDDs, broadcast RDDs, perform actions inside of transformations (e.g. calling `count()` from inside of `map()`), and call certain methods on stopped SparkContexts. Currently, these invalid programs lead to confusing NullPointerExceptions at runtime and have been a major source of questions on the mailing list and StackOverflow. In a few cases, I chose to log warnings instead of throwing exceptions in order to avoid any chance that this patch breaks programs that worked "by accident" in earlier Spark releases (e.g. programs that define nested RDDs but never run any jobs with them). In SparkContext, the new `assertNotStopped()` method is used to check whether methods are being invoked on a stopped SparkContext. In some cases, user programs will not crash in spite of calling methods on stopped SparkContexts, so I've only added `assertNotStopped()` calls to methods that always throw exceptions when called on stopped contexts (e.g. by dereferencing a null `dagScheduler` pointer). Author: Josh Rosen <joshrosen@databricks.com> Closes #3884 from JoshRosen/SPARK-5063 and squashes the following commits: a38774b [Josh Rosen] Fix spelling typo a943e00 [Josh Rosen] Convert two exceptions into warnings in order to avoid breaking user programs in some edge-cases. 2d0d7f7 [Josh Rosen] Fix test to reflect 1.2.1 compatibility 3f0ea0c [Josh Rosen] Revert two unintentional formatting changes 8e5da69 [Josh Rosen] Remove assertNotStopped() calls for methods that were sometimes safe to call on stopped SC's in Spark 1.2 8cff41a [Josh Rosen] IllegalStateException fix 6ef68d0 [Josh Rosen] Fix Python line length issues. 9f6a0b8 [Josh Rosen] Add improved error messages to PySpark. 13afd0f [Josh Rosen] SparkException -> IllegalStateException 8d404f3 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-5063 b39e041 [Josh Rosen] Fix BroadcastSuite test which broadcasted an RDD 99cc09f [Josh Rosen] Guard against calling methods on stopped SparkContexts. 34833e8 [Josh Rosen] Add more descriptive error message. 57cc8a1 [Josh Rosen] Add error message when directly broadcasting RDD. 15b2e6b [Josh Rosen] [SPARK-5063] Useful error messages for nested RDDs and actions inside of transformations
author: Josh Rosen <joshrosen@databricks.com> 2015-01-23 17:53:15 -0800
committer: Josh Rosen <joshrosen@databricks.com> 2015-01-23 17:53:15 -0800
commit: cef1f092a628ac20709857b4388bb10e0b5143b0 (patch)
tree: 1a383501a69bdd6335b8bb8db9ca88352b8e6846 /python
parent: ea74365b7c5a3ac29cae9ba66f140f1fa5e8d312 (diff)
download: spark-cef1f092a628ac20709857b4388bb10e0b5143b0.tar.gz
spark-cef1f092a628ac20709857b4388bb10e0b5143b0.tar.bz2
spark-cef1f092a628ac20709857b4388bb10e0b5143b0.zip
2 files changed, 19 insertions, 0 deletions
diff --git a/python/pyspark/context.py b/python/pyspark/context.py
index 64f6a3ca6b..568e21f380 100644
--- a/python/pyspark/context.py
+++ b/python/pyspark/context.py
@@ -229,6 +229,14 @@ class SparkContext(object):
                 else:
                     SparkContext._active_spark_context = instance
 
+    def __getnewargs__(self):
+        # This method is called when attempting to pickle SparkContext, which is always an error:
+        raise Exception(
+            "It appears that you are attempting to reference SparkContext from a broadcast "
+            "variable, action, or transforamtion. SparkContext can only be used on the driver, "
+            "not in code that it run on workers. For more information, see SPARK-5063."
+        )
+
     def __enter__(self):
         """
         Enable 'with SparkContext(...) as sc: app(sc)' syntax.
diff --git a/python/pyspark/rdd.py b/python/pyspark/rdd.py
index 4977400ac1..f4cfe4845d 100644
--- a/python/pyspark/rdd.py
+++ b/python/pyspark/rdd.py
@@ -141,6 +141,17 @@ class RDD(object):
     def __repr__(self):
         return self._jrdd.toString()
 
+    def __getnewargs__(self):
+        # This method is called when attempting to pickle an RDD, which is always an error:
+        raise Exception(
+            "It appears that you are attempting to broadcast an RDD or reference an RDD from an "
+            "action or transformation. RDD transformations and actions can only be invoked by the "
+            "driver, not inside of other transformations; for example, "
+            "rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values "
+            "transformation and count action cannot be performed inside of the rdd1.map "
+            "transformation. For more information, see SPARK-5063."
+        )
+
     @property
     def context(self):
         """
author	Josh Rosen <joshrosen@databricks.com>	2015-01-23 17:53:15 -0800
committer	Josh Rosen <joshrosen@databricks.com>	2015-01-23 17:53:15 -0800
commit	cef1f092a628ac20709857b4388bb10e0b5143b0 (patch)
tree	1a383501a69bdd6335b8bb8db9ca88352b8e6846 /python
parent	ea74365b7c5a3ac29cae9ba66f140f1fa5e8d312 (diff)
download	spark-cef1f092a628ac20709857b4388bb10e0b5143b0.tar.gz spark-cef1f092a628ac20709857b4388bb10e0b5143b0.tar.bz2 spark-cef1f092a628ac20709857b4388bb10e0b5143b0.zip