diff options
author | Yin Huai <yhuai@databricks.com> | 2015-05-20 11:23:40 -0700 |
---|---|---|
committer | Yin Huai <yhuai@databricks.com> | 2015-05-20 11:23:49 -0700 |
commit | 55bd1bb52e54f710264e6517bb42b74672dd71fb (patch) | |
tree | 349aed0bf794cfcca003fa17e4d926183c6ac69b /mllib/src/main | |
parent | 606ae3e10e76325c032860ad7be1da94921af44a (diff) | |
download | spark-55bd1bb52e54f710264e6517bb42b74672dd71fb.tar.gz spark-55bd1bb52e54f710264e6517bb42b74672dd71fb.tar.bz2 spark-55bd1bb52e54f710264e6517bb42b74672dd71fb.zip |
[SPARK-7713] [SQL] Use shared broadcast hadoop conf for partitioned table scan.
https://issues.apache.org/jira/browse/SPARK-7713
I tested the performance with the following code:
```scala
import sqlContext._
import sqlContext.implicits._
(1 to 5000).foreach { i =>
val df = (1 to 1000).map(j => (j, s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i")
}
sqlContext.sql("""
CREATE TEMPORARY TABLE partitionedParquet
USING org.apache.spark.sql.parquet
OPTIONS (
path '/tmp/partitioned'
)""")
table("partitionedParquet").explain(true)
```
In our master `explain` takes 40s in my laptop. With this PR, `explain` takes 14s.
Author: Yin Huai <yhuai@databricks.com>
Closes #6252 from yhuai/broadcastHadoopConf and squashes the following commits:
6fa73df [Yin Huai] Address comments of Josh and Andrew.
807fbf9 [Yin Huai] Make the new buildScan and SqlNewHadoopRDD private sql.
e393555 [Yin Huai] Cheng's comments.
2eb53bb [Yin Huai] Use a shared broadcast Hadoop Configuration for partitioned HadoopFsRelations.
(cherry picked from commit b631bf73b9f288f37c98b806be430b22485880e5)
Signed-off-by: Yin Huai <yhuai@databricks.com>
Diffstat (limited to 'mllib/src/main')
0 files changed, 0 insertions, 0 deletions