diff options
author | Josh Rosen <joshrosen@eecs.berkeley.edu> | 2012-12-24 17:20:10 -0800 |
---|---|---|
committer | Josh Rosen <joshrosen@eecs.berkeley.edu> | 2012-12-24 17:20:10 -0800 |
commit | 4608902fb87af64a15b97ab21fe6382cd6e5a644 (patch) | |
tree | ef2e9e7a9d3f88dccd70e7b6e39753354b605207 /pyspark/pyspark/serializers.py | |
parent | ccd075cf960df6c6c449b709515cdd81499a52be (diff) | |
download | spark-4608902fb87af64a15b97ab21fe6382cd6e5a644.tar.gz spark-4608902fb87af64a15b97ab21fe6382cd6e5a644.tar.bz2 spark-4608902fb87af64a15b97ab21fe6382cd6e5a644.zip |
Use filesystem to collect RDDs in PySpark.
Passing large volumes of data through Py4J seems
to be slow. It appears to be faster to write the
data to the local filesystem and read it back from
Python.
Diffstat (limited to 'pyspark/pyspark/serializers.py')
-rw-r--r-- | pyspark/pyspark/serializers.py | 8 |
1 files changed, 8 insertions, 0 deletions
diff --git a/pyspark/pyspark/serializers.py b/pyspark/pyspark/serializers.py index 21ef8b106c..bfcdda8f12 100644 --- a/pyspark/pyspark/serializers.py +++ b/pyspark/pyspark/serializers.py @@ -33,3 +33,11 @@ def read_with_length(stream): if obj == "": raise EOFError return obj + + +def read_from_pickle_file(stream): + try: + while True: + yield load_pickle(read_with_length(stream)) + except EOFError: + return |