diff options
author | Matei Zaharia <matei@databricks.com> | 2014-08-18 10:05:52 -0700 |
---|---|---|
committer | Michael Armbrust <michael@databricks.com> | 2014-08-18 10:05:52 -0700 |
commit | 6a13dca12fac06f3af892ffcc8922cc84f91b786 (patch) | |
tree | ba7935890d765436e3e17cf247ad21069141e499 | |
parent | 7ae28d1247e4756219016206c51fec1656e3917b (diff) | |
download | spark-6a13dca12fac06f3af892ffcc8922cc84f91b786.tar.gz spark-6a13dca12fac06f3af892ffcc8922cc84f91b786.tar.bz2 spark-6a13dca12fac06f3af892ffcc8922cc84f91b786.zip |
[SPARK-3084] [SQL] Collect broadcasted tables in parallel in joins
BroadcastHashJoin has a broadcastFuture variable that tries to collect
the broadcasted table in a separate thread, but this doesn't help
because it's a lazy val that only gets initialized when you attempt to
build the RDD. Thus queries that broadcast multiple tables would collect
and broadcast them sequentially. I changed this to a val to let it start
collecting right when the operator is created.
Author: Matei Zaharia <matei@databricks.com>
Closes #1990 from mateiz/spark-3084 and squashes the following commits:
f468766 [Matei Zaharia] [SPARK-3084] Collect broadcasted tables in parallel in joins
-rw-r--r-- | sql/core/src/main/scala/org/apache/spark/sql/execution/joins.scala | 2 |
1 files changed, 1 insertions, 1 deletions
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins.scala index c86811e838..481bb8c05e 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins.scala @@ -424,7 +424,7 @@ case class BroadcastHashJoin( UnspecifiedDistribution :: UnspecifiedDistribution :: Nil @transient - lazy val broadcastFuture = future { + val broadcastFuture = future { sparkContext.broadcast(buildPlan.executeCollect()) } |