[SPARK-3030] [PySpark] Reuse Python worker

Reuse Python worker to avoid the overhead of fork() Python process for each tasks. It also tracks the broadcasts for each worker, avoid sending repeated broadcasts. This can reduce the time for dummy task from 22ms to 13ms (-40%). It can help to reduce the latency for Spark Streaming. For a job with broadcast (43M after compress): ``` b = sc.broadcast(set(range(30000000))) print sc.parallelize(range(24000), 100).filter(lambda x: x in b.value).count() ``` It will finish in 281s without reused worker, and it will finish in 65s with reused worker(4 CPUs). After reusing the worker, it can save about 9 seconds for transfer and deserialize the broadcast for each tasks. It's enabled by default, could be disabled by `spark.python.worker.reuse = false`. Author: Davies Liu <davies.liu@gmail.com> Closes #2259 from davies/reuse-worker and squashes the following commits: f11f617 [Davies Liu] Merge branch 'master' into reuse-worker 3939f20 [Davies Liu] fix bug in serializer in mllib cf1c55e [Davies Liu] address comments 3133a60 [Davies Liu] fix accumulator with reused worker 760ab1f [Davies Liu] do not reuse worker if there are any exceptions 7abb224 [Davies Liu] refactor: sychronized with itself ac3206e [Davies Liu] renaming 8911f44 [Davies Liu] synchronized getWorkerBroadcasts() 6325fc1 [Davies Liu] bugfix: bid >= 0 e0131a2 [Davies Liu] fix name of config 583716e [Davies Liu] only reuse completed and not interrupted worker ace2917 [Davies Liu] kill python worker after timeout 6123d0f [Davies Liu] track broadcasts for each worker 8d2f08c [Davies Liu] reuse python worker
author: Davies Liu <davies.liu@gmail.com> 2014-09-13 16:22:04 -0700
committer: Josh Rosen <joshrosen@apache.org> 2014-09-13 16:22:04 -0700
commit: 2aea0da84c58a179917311290083456dfa043db7 (patch)
tree: 6cda208e50f24c31883f1fdf2f51b7a6a8399ff1 /python/pyspark/daemon.py
parent: 0f8c4edf4e750e3d11da27cc22c40b0489da7f37 (diff)
download: spark-2aea0da84c58a179917311290083456dfa043db7.tar.gz
spark-2aea0da84c58a179917311290083456dfa043db7.tar.bz2
spark-2aea0da84c58a179917311290083456dfa043db7.zip
1 files changed, 19 insertions, 19 deletions
diff --git a/python/pyspark/daemon.py b/python/pyspark/daemon.py
index 15445abf67..64d6202acb 100644
--- a/python/pyspark/daemon.py
+++ b/python/pyspark/daemon.py
@@ -23,6 +23,7 @@ import socket
 import sys
 import traceback
 import time
+import gc
 from errno import EINTR, ECHILD, EAGAIN
 from socket import AF_INET, SOCK_STREAM, SOMAXCONN
 from signal import SIGHUP, SIGTERM, SIGCHLD, SIG_DFL, SIG_IGN
@@ -46,17 +47,6 @@ def worker(sock):
     signal.signal(SIGCHLD, SIG_DFL)
     signal.signal(SIGTERM, SIG_DFL)
 
-    # Blocks until the socket is closed by draining the input stream
-    # until it raises an exception or returns EOF.
-    def waitSocketClose(sock):
-        try:
-            while True:
-                # Empty string is returned upon EOF (and only then).
-                if sock.recv(4096) == '':
-                    return
-        except:
-            pass
-
     # Read the socket using fdopen instead of socket.makefile() because the latter
     # seems to be very slow; note that we need to dup() the file descriptor because
     # otherwise writes also cause a seek that makes us miss data on the read side.
@@ -64,17 +54,13 @@ def worker(sock):
     outfile = os.fdopen(os.dup(sock.fileno()), "a+", 65536)
     exit_code = 0
     try:
-        # Acknowledge that the fork was successful
-        write_int(os.getpid(), outfile)
-        outfile.flush()
         worker_main(infile, outfile)
     except SystemExit as exc:
-        exit_code = exc.code
+        exit_code = compute_real_exit_code(exc.code)
     finally:
         outfile.flush()
-        # The Scala side will close the socket upon task completion.
-        waitSocketClose(sock)
-        os._exit(compute_real_exit_code(exit_code))
+        if exit_code:
+            os._exit(exit_code)
 
 
 # Cleanup zombie children
@@ -111,6 +97,8 @@ def manager():
     signal.signal(SIGTERM, handle_sigterm)  # Gracefully exit on SIGTERM
     signal.signal(SIGHUP, SIG_IGN)  # Don't die on SIGHUP
 
+    reuse = os.environ.get("SPARK_REUSE_WORKER")
+
     # Initialization complete
     try:
         while True:
@@ -163,7 +151,19 @@ def manager():
                     # in child process
                     listen_sock.close()
                     try:
-                        worker(sock)
+                        # Acknowledge that the fork was successful
+                        outfile = sock.makefile("w")
+                        write_int(os.getpid(), outfile)
+                        outfile.flush()
+                        outfile.close()
+                        while True:
+                            worker(sock)
+                            if not reuse:
+                                # wait for closing
+                                while sock.recv(1024):
+                                    pass
+                                break
+                            gc.collect()
                     except:
                         traceback.print_exc()
                         os._exit(1)
author	Davies Liu <davies.liu@gmail.com>	2014-09-13 16:22:04 -0700
committer	Josh Rosen <joshrosen@apache.org>	2014-09-13 16:22:04 -0700
commit	2aea0da84c58a179917311290083456dfa043db7 (patch)
tree	6cda208e50f24c31883f1fdf2f51b7a6a8399ff1 /python/pyspark/daemon.py
parent	0f8c4edf4e750e3d11da27cc22c40b0489da7f37 (diff)
download	spark-2aea0da84c58a179917311290083456dfa043db7.tar.gz spark-2aea0da84c58a179917311290083456dfa043db7.tar.bz2 spark-2aea0da84c58a179917311290083456dfa043db7.zip