[SPARK-14334] [SQL] add toLocalIterator for Dataset/DataFrame

## What changes were proposed in this pull request? RDD.toLocalIterator() could be used to fetch one partition at a time to reduce the memory usage. Right now, for Dataset/Dataframe we have to use df.rdd.toLocalIterator, which is super slow also requires lots of memory (because of the Java serializer or even Kyro serializer). This PR introduce an optimized toLocalIterator for Dataset/DataFrame, which is much faster and requires much less memory. For a partition with 5 millions rows, `df.rdd.toIterator` took about 100 seconds, but df.toIterator took less than 7 seconds. For 10 millions row, rdd.toIterator will crash (not enough memory) with 4G heap, but df.toLocalIterator could finished in 12 seconds. The JDBC server has been updated to use DataFrame.toIterator. ## How was this patch tested? Existing tests. Author: Davies Liu <davies@databricks.com> Closes #12114 from davies/local_iterator.
author: Davies Liu <davies@databricks.com> 2016-04-04 13:31:44 -0700
committer: Davies Liu <davies.liu@gmail.com> 2016-04-04 13:31:44 -0700
commit: cc70f174169f45c85d459126a68bbe43c0bec328 (patch)
tree: a455d9fab09fb11f6d78265243a1a301a6e5563b /sql/hive-thriftserver
parent: 7143904700435265975d36f073cce2833467e121 (diff)
download: spark-cc70f174169f45c85d459126a68bbe43c0bec328.tar.gz
spark-cc70f174169f45c85d459126a68bbe43c0bec328.tar.bz2
spark-cc70f174169f45c85d459126a68bbe43c0bec328.zip
1 files changed, 1 insertions, 1 deletions
diff --git a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
index a955314ba3..673a293ce2 100644
--- a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
+++ b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
@@ -222,7 +222,7 @@ private[hive] class SparkExecuteStatementOperation(
         val useIncrementalCollect =
           hiveContext.getConf("spark.sql.thriftServer.incrementalCollect", "false").toBoolean
         if (useIncrementalCollect) {
-          result.rdd.toLocalIterator
+          result.toLocalIterator.asScala
         } else {
           result.collect().iterator
         }
author	Davies Liu <davies@databricks.com>	2016-04-04 13:31:44 -0700
committer	Davies Liu <davies.liu@gmail.com>	2016-04-04 13:31:44 -0700
commit	cc70f174169f45c85d459126a68bbe43c0bec328 (patch)
tree	a455d9fab09fb11f6d78265243a1a301a6e5563b /sql/hive-thriftserver
parent	7143904700435265975d36f073cce2833467e121 (diff)
download	spark-cc70f174169f45c85d459126a68bbe43c0bec328.tar.gz spark-cc70f174169f45c85d459126a68bbe43c0bec328.tar.bz2 spark-cc70f174169f45c85d459126a68bbe43c0bec328.zip