diff options
author | Davies Liu <davies@databricks.com> | 2014-11-24 17:17:03 -0800 |
---|---|---|
committer | Josh Rosen <joshrosen@databricks.com> | 2014-11-24 17:17:03 -0800 |
commit | 6cf507685efd01df77d663145ae08e48c7f92948 (patch) | |
tree | bdca89f0ce6e0304e93a605a697adbfec4c6f737 /python/pyspark/worker.py | |
parent | 050616b408c60eae02256913ceb645912dbff62e (diff) | |
download | spark-6cf507685efd01df77d663145ae08e48c7f92948.tar.gz spark-6cf507685efd01df77d663145ae08e48c7f92948.tar.bz2 spark-6cf507685efd01df77d663145ae08e48c7f92948.zip |
[SPARK-4548] []SPARK-4517] improve performance of python broadcast
Re-implement the Python broadcast using file:
1) serialize the python object using cPickle, write into disks.
2) Create a wrapper in JVM (for the dumped file), it read data from during serialization
3) Using TorrentBroadcast or HttpBroadcast to transfer the data (compressed) into executors
4) During deserialization, writing the data into disk.
5) Passing the path into Python worker, read data from disk and unpickle it into python object, until the first access.
It fixes the performance regression introduced in #2659, has similar performance as 1.1, but support object larger than 2G, also improve the memory efficiency (only one compressed copy in driver and executor).
Testing with a 500M broadcast and 4 tasks (excluding the benefit from reused worker in 1.2):
name | 1.1 | 1.2 with this patch | improvement
---------|--------|---------|--------
python-broadcast-w-bytes | 25.20 | 9.33 | 170.13% |
python-broadcast-w-set | 4.13 | 4.50 | -8.35% |
Testing with 100 tasks (16 CPUs):
name | 1.1 | 1.2 with this patch | improvement
---------|--------|---------|--------
python-broadcast-w-bytes | 38.16 | 8.40 | 353.98%
python-broadcast-w-set | 23.29 | 9.59 | 142.80%
Author: Davies Liu <davies@databricks.com>
Closes #3417 from davies/pybroadcast and squashes the following commits:
50a58e0 [Davies Liu] address comments
b98de1d [Davies Liu] disable gc while unpickle
e5ee6b9 [Davies Liu] support large string
09303b8 [Davies Liu] read all data into memory
dde02dd [Davies Liu] improve performance of python broadcast
Diffstat (limited to 'python/pyspark/worker.py')
-rw-r--r-- | python/pyspark/worker.py | 10 |
1 files changed, 3 insertions, 7 deletions
diff --git a/python/pyspark/worker.py b/python/pyspark/worker.py index e1552a0b0b..7e5343c973 100644 --- a/python/pyspark/worker.py +++ b/python/pyspark/worker.py @@ -30,8 +30,7 @@ from pyspark.accumulators import _accumulatorRegistry from pyspark.broadcast import Broadcast, _broadcastRegistry from pyspark.files import SparkFiles from pyspark.serializers import write_with_length, write_int, read_long, \ - write_long, read_int, SpecialLengths, UTF8Deserializer, PickleSerializer, \ - SizeLimitedStream, LargeObjectSerializer + write_long, read_int, SpecialLengths, UTF8Deserializer, PickleSerializer from pyspark import shuffle pickleSer = PickleSerializer() @@ -78,14 +77,11 @@ def main(infile, outfile): # fetch names and values of broadcast variables num_broadcast_variables = read_int(infile) - bser = LargeObjectSerializer() for _ in range(num_broadcast_variables): bid = read_long(infile) if bid >= 0: - size = read_long(infile) - s = SizeLimitedStream(infile, size) - value = list((bser.load_stream(s)))[0] # read out all the bytes - _broadcastRegistry[bid] = Broadcast(bid, value) + path = utf8_deserializer.loads(infile) + _broadcastRegistry[bid] = Broadcast(path=path) else: bid = - bid - 1 _broadcastRegistry.pop(bid) |