From 5a7af9e7ac85e04aa4a420bc2887207bfa18f792 Mon Sep 17 00:00:00 2001
From: Nong Li <nong@databricks.com>
Date: Wed, 24 Feb 2016 17:16:45 -0800
Subject: [SPARK-13250] [SQL] Update PhysicallRDD to convert to UnsafeRow if
 using the vectorized scanner.

Some parts of the engine rely on UnsafeRow which the vectorized parquet scanner does not want
to produce. This add a conversion in Physical RDD. In the case where codegen is used (and the
scan is the start of the pipeline), there is no requirement to use UnsafeRow. This patch adds
update PhysicallRDD to support codegen, which eliminates the need for the UnsafeRow conversion
in all cases.

The result of these changes for TPCDS-Q19 at the 10gb sf reduces the query time from 9.5 seconds
to 6.5 seconds.

Author: Nong Li <nong@databricks.com>

Closes #11141 from nongli/spark-13250.
---
 python/pyspark/sql/dataframe.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

(limited to 'python/pyspark')

diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index bf43452e08..7275e69353 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -173,7 +173,8 @@ class DataFrame(object):
 
         >>> df.explain()
         == Physical Plan ==
-        Scan ExistingRDD[age#0,name#1]
+        WholeStageCodegen
+        :  +- Scan ExistingRDD[age#0,name#1]
 
         >>> df.explain(True)
         == Parsed Logical Plan ==
-- 
cgit v1.2.3