diff options
author | Cheng Lian <lian@databricks.com> | 2015-04-08 07:00:56 +0800 |
---|---|---|
committer | Cheng Lian <lian@databricks.com> | 2015-04-08 07:00:56 +0800 |
commit | 77bcceb9f01e97cb6f41791f2167b40c4311f701 (patch) | |
tree | e6071a260685a9f9ea8a6466c30d51f3938e830c /sql | |
parent | fc957dc78138e72036dbbadc9a54f155d318c038 (diff) | |
download | spark-77bcceb9f01e97cb6f41791f2167b40c4311f701.tar.gz spark-77bcceb9f01e97cb6f41791f2167b40c4311f701.tar.bz2 spark-77bcceb9f01e97cb6f41791f2167b40c4311f701.zip |
[SPARK-6748] [SQL] Makes QueryPlan.schema a lazy val
`DataFrame.collect()` calls `SparkPlan.executeCollect()`, which consists of a single line:
```scala
execute().map(ScalaReflection.convertRowToScala(_, schema)).collect()
```
The problem is that, `QueryPlan.schema` is a function. And since 1.3.0, `convertRowToScala` starts returning a `GenericRowWithSchema`. Thus, every `GenericRowWithSchema` instance holds a separate copy of the schema object. Also, YJP profiling result of the following simple micro benchmark (executed in Spark shell) shows that constructing the schema object takes up to ~35% CPU time.
```scala
sc.parallelize(1 to 10000000).
map(i => (i, s"val_$i")).
toDF("key", "value").
saveAsParquetFile("file:///tmp/src.parquet")
// Profiling started from this line
sqlContext.parquetFile("file:///tmp/src.parquet").collect()
```
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5398)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes #5398 from liancheng/spark-6748 and squashes the following commits:
3159469 [Cheng Lian] Makes QueryPlan.schema a lazy val
Diffstat (limited to 'sql')
-rw-r--r-- | sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala | 2 |
1 files changed, 1 insertions, 1 deletions
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala index 02f7c26a8a..7967189cac 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala @@ -150,7 +150,7 @@ abstract class QueryPlan[PlanType <: TreeNode[PlanType]] extends TreeNode[PlanTy }.toSeq } - def schema: StructType = StructType.fromAttributes(output) + lazy val schema: StructType = StructType.fromAttributes(output) /** Returns the output schema in the tree format. */ def schemaString: String = schema.treeString |