[SPARK-10142] [STREAMING] Made python checkpoint recovery handle non-local checkpoint paths and existing SparkContexts

The current code only checks checkpoint files in local filesystem, and always tries to create a new Python SparkContext (even if one already exists). The solution is to do the following: 1. Use the same code path as Java to check whether a valid checkpoint exists 2. Create a new Python SparkContext only if there no active one. There is not test for the path as its hard to test with distributed filesystem paths in a local unit test. I am going to test it with a distributed file system manually to verify that this patch works. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8366 from tdas/SPARK-10142 and squashes the following commits: 3afa666 [Tathagata Das] Added tests 2dd4ae5 [Tathagata Das] Added the check to not create a context if one already exists 9bf151b [Tathagata Das] Made python checkpoint recovery use java to find the checkpoint files
author: Tathagata Das <tathagata.das1565@gmail.com> 2015-08-23 19:24:32 -0700
committer: Tathagata Das <tathagata.das1565@gmail.com> 2015-08-23 19:24:32 -0700
commit: 053d94fcf32268369b5a40837271f15d6af41aa4 (patch)
tree: 04e0e1d58f4d291c4bd50a7d59444eec0c88f1d5 /python/pyspark/streaming/context.py
parent: b963c19a803c5a26c9b65655d40ca6621acf8bd4 (diff)
download: spark-053d94fcf32268369b5a40837271f15d6af41aa4.tar.gz
spark-053d94fcf32268369b5a40837271f15d6af41aa4.tar.bz2
spark-053d94fcf32268369b5a40837271f15d6af41aa4.zip
1 files changed, 13 insertions, 9 deletions
diff --git a/python/pyspark/streaming/context.py b/python/pyspark/streaming/context.py
index e3ba70e4e5..4069d7a149 100644
--- a/python/pyspark/streaming/context.py
+++ b/python/pyspark/streaming/context.py
@@ -150,26 +150,30 @@ class StreamingContext(object):
         @param checkpointPath: Checkpoint directory used in an earlier streaming program
         @param setupFunc:      Function to create a new context and setup DStreams
         """
-        # TODO: support checkpoint in HDFS
-        if not os.path.exists(checkpointPath) or not os.listdir(checkpointPath):
+        cls._ensure_initialized()
+        gw = SparkContext._gateway
+
+        # Check whether valid checkpoint information exists in the given path
+        if gw.jvm.CheckpointReader.read(checkpointPath).isEmpty():
             ssc = setupFunc()
             ssc.checkpoint(checkpointPath)
             return ssc
 
-        cls._ensure_initialized()
-        gw = SparkContext._gateway
-
         try:
             jssc = gw.jvm.JavaStreamingContext(checkpointPath)
         except Exception:
             print("failed to load StreamingContext from checkpoint", file=sys.stderr)
             raise
 
-        jsc = jssc.sparkContext()
-        conf = SparkConf(_jconf=jsc.getConf())
-        sc = SparkContext(conf=conf, gateway=gw, jsc=jsc)
+        # If there is already an active instance of Python SparkContext use it, or create a new one
+        if not SparkContext._active_spark_context:
+            jsc = jssc.sparkContext()
+            conf = SparkConf(_jconf=jsc.getConf())
+            SparkContext(conf=conf, gateway=gw, jsc=jsc)
+
+        sc = SparkContext._active_spark_context
+
         # update ctx in serializer
-        SparkContext._active_spark_context = sc
         cls._transformerSerializer.ctx = sc
         return StreamingContext(sc, None, jssc)
author	Tathagata Das <tathagata.das1565@gmail.com>	2015-08-23 19:24:32 -0700
committer	Tathagata Das <tathagata.das1565@gmail.com>	2015-08-23 19:24:32 -0700
commit	053d94fcf32268369b5a40837271f15d6af41aa4 (patch)
tree	04e0e1d58f4d291c4bd50a7d59444eec0c88f1d5 /python/pyspark/streaming/context.py
parent	b963c19a803c5a26c9b65655d40ca6621acf8bd4 (diff)
download	spark-053d94fcf32268369b5a40837271f15d6af41aa4.tar.gz spark-053d94fcf32268369b5a40837271f15d6af41aa4.tar.bz2 spark-053d94fcf32268369b5a40837271f15d6af41aa4.zip