diff options
author | Michael Armbrust <michael@databricks.com> | 2014-04-07 00:14:00 -0700 |
---|---|---|
committer | Reynold Xin <rxin@apache.org> | 2014-04-07 00:14:00 -0700 |
commit | accd0999f9cb6a449434d3fc5274dd469eeecab2 (patch) | |
tree | 6c1a73644c3af4aee9a60872bc58d0a1a2bc531a /sql/hive/src/main | |
parent | 87d0928a3301835705652c24a26096546597e156 (diff) | |
download | spark-accd0999f9cb6a449434d3fc5274dd469eeecab2.tar.gz spark-accd0999f9cb6a449434d3fc5274dd469eeecab2.tar.bz2 spark-accd0999f9cb6a449434d3fc5274dd469eeecab2.zip |
[SQL] SPARK-1371 Hash Aggregation Improvements
Given:
```scala
case class Data(a: Int, b: Int)
val rdd =
sparkContext
.parallelize(1 to 200)
.flatMap(_ => (1 to 50000).map(i => Data(i % 100, i)))
rdd.registerAsTable("data")
cacheTable("data")
```
Before:
```
SELECT COUNT(*) FROM data:[10000000]
16795.567ms
SELECT a, SUM(b) FROM data GROUP BY a
7536.436ms
SELECT SUM(b) FROM data
10954.1ms
```
After:
```
SELECT COUNT(*) FROM data:[10000000]
1372.175ms
SELECT a, SUM(b) FROM data GROUP BY a
2070.446ms
SELECT SUM(b) FROM data
958.969ms
```
Author: Michael Armbrust <michael@databricks.com>
Closes #295 from marmbrus/hashAgg and squashes the following commits:
ec63575 [Michael Armbrust] Add comment.
d0495a9 [Michael Armbrust] Use scaladoc instead.
b4a6887 [Michael Armbrust] Address review comments.
a2d90ba [Michael Armbrust] Capture child output statically to avoid issues with generators and serialization.
7c13112 [Michael Armbrust] Rewrite Aggregate operator to stream input and use projections. Remove unused local RDD functions implicits.
5096f99 [Michael Armbrust] Make HiveUDAF fields transient since object inspectors are not serializable.
6a4b671 [Michael Armbrust] Add option to avoid binding operators expressions automatically.
92cca08 [Michael Armbrust] Always include serialization debug info when running tests.
1279df2 [Michael Armbrust] Increase default number of partitions.
Diffstat (limited to 'sql/hive/src/main')
-rw-r--r-- | sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala | 3 |
1 files changed, 3 insertions, 0 deletions
diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala index 44901db3f9..2c607455c8 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala @@ -337,13 +337,16 @@ case class HiveGenericUdaf( type UDFType = AbstractGenericUDAFResolver + @transient protected lazy val resolver: AbstractGenericUDAFResolver = createFunction(name) + @transient protected lazy val objectInspector = { resolver.getEvaluator(children.map(_.dataType.toTypeInfo).toArray) .init(GenericUDAFEvaluator.Mode.COMPLETE, inspectors.toArray) } + @transient protected lazy val inspectors = children.map(_.dataType).map(toInspector) def dataType: DataType = inspectorToDataType(objectInspector) |