[SPARK-3047] [PySpark] add an option to use str in textFileRDD

str is much efficient than unicode (both CPU and memory), it'e better to use str in textFileRDD. In order to keep compatibility, use unicode by default. (Maybe change it in the future). use_unicode=True: daviesliudm:~/work/spark$ time python wc.py (u'./universe/spark/sql/core/target/java/org/apache/spark/sql/execution/ExplainCommand$.java', 7776) real 2m8.298s user 0m0.185s sys 0m0.064s use_unicode=False daviesliudm:~/work/spark$ time python wc.py ('./universe/spark/sql/core/target/java/org/apache/spark/sql/execution/ExplainCommand$.java', 7776) real 1m26.402s user 0m0.182s sys 0m0.062s We can see that it got 32% improvement! Author: Davies Liu <davies.liu@gmail.com> Closes #1951 from davies/unicode and squashes the following commits: 8352d57 [Davies Liu] update version number a286f2f [Davies Liu] rollback loads() 85246e5 [Davies Liu] add docs for use_unicode a0295e1 [Davies Liu] add an option to use str in textFile()
author: Davies Liu <davies.liu@gmail.com> 2014-09-11 11:50:36 -0700
committer: Josh Rosen <joshrosen@apache.org> 2014-09-11 11:50:36 -0700
commit: 1ef656ea85b4b93c7b0f3cf8042b63a0de0901cb (patch)
tree: f0480c59cad1ab80cf3e03edd6d2d423c6e037b3 /python/pyspark/serializers.py
parent: ed1980ffa9ccb87d76694ba910ef22df034bca49 (diff)
download: spark-1ef656ea85b4b93c7b0f3cf8042b63a0de0901cb.tar.gz
spark-1ef656ea85b4b93c7b0f3cf8042b63a0de0901cb.tar.bz2
spark-1ef656ea85b4b93c7b0f3cf8042b63a0de0901cb.zip
1 files changed, 11 insertions, 7 deletions
diff --git a/python/pyspark/serializers.py b/python/pyspark/serializers.py
index 55e6cf3308..7b2710b913 100644
--- a/python/pyspark/serializers.py
+++ b/python/pyspark/serializers.py
@@ -429,18 +429,22 @@ class UTF8Deserializer(Serializer):
     Deserializes streams written by String.getBytes.
     """
 
+    def __init__(self, use_unicode=False):
+        self.use_unicode = use_unicode
+
     def loads(self, stream):
         length = read_int(stream)
-        return stream.read(length).decode('utf8')
+        s = stream.read(length)
+        return s.decode("utf-8") if self.use_unicode else s
 
     def load_stream(self, stream):
-        while True:
-            try:
+        try:
+            while True:
                 yield self.loads(stream)
-            except struct.error:
-                return
-            except EOFError:
-                return
+        except struct.error:
+            return
+        except EOFError:
+            return
 
 
 def read_long(stream):
author	Davies Liu <davies.liu@gmail.com>	2014-09-11 11:50:36 -0700
committer	Josh Rosen <joshrosen@apache.org>	2014-09-11 11:50:36 -0700
commit	1ef656ea85b4b93c7b0f3cf8042b63a0de0901cb (patch)
tree	f0480c59cad1ab80cf3e03edd6d2d423c6e037b3 /python/pyspark/serializers.py
parent	ed1980ffa9ccb87d76694ba910ef22df034bca49 (diff)
download	spark-1ef656ea85b4b93c7b0f3cf8042b63a0de0901cb.tar.gz spark-1ef656ea85b4b93c7b0f3cf8042b63a0de0901cb.tar.bz2 spark-1ef656ea85b4b93c7b0f3cf8042b63a0de0901cb.zip