aboutsummaryrefslogtreecommitdiff
path: root/python
diff options
context:
space:
mode:
authorReynold Xin <rxin@databricks.com>2015-09-23 16:43:21 -0700
committerReynold Xin <rxin@databricks.com>2015-09-23 16:43:21 -0700
commit9952217749118ae78fe794ca11e1c4a87a4ae8ba (patch)
treecf71cc84eb34acdeade45cc8be3642db4faa8d54 /python
parent067afb4e9bb227f159bcbc2aafafce9693303ea9 (diff)
downloadspark-9952217749118ae78fe794ca11e1c4a87a4ae8ba.tar.gz
spark-9952217749118ae78fe794ca11e1c4a87a4ae8ba.tar.bz2
spark-9952217749118ae78fe794ca11e1c4a87a4ae8ba.zip
[SPARK-10731] [SQL] Delegate to Scala's DataFrame.take implementation in Python DataFrame.
Python DataFrame.head/take now requires scanning all the partitions. This pull request changes them to delegate the actual implementation to Scala DataFrame (by calling DataFrame.take). This is more of a hack for fixing this issue in 1.5.1. A more proper fix is to change executeCollect and executeTake to return InternalRow rather than Row, and thus eliminate the extra round-trip conversion. Author: Reynold Xin <rxin@databricks.com> Closes #8876 from rxin/SPARK-10731.
Diffstat (limited to 'python')
-rw-r--r--python/pyspark/sql/dataframe.py5
1 files changed, 4 insertions, 1 deletions
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index 80f8d8a0eb..b09422aade 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -300,7 +300,10 @@ class DataFrame(object):
>>> df.take(2)
[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
"""
- return self.limit(num).collect()
+ with SCCallSiteSync(self._sc) as css:
+ port = self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe(
+ self._jdf, num)
+ return list(_load_from_socket(port, BatchedSerializer(PickleSerializer())))
@ignore_unicode_prefix
@since(1.3)