diff options
author | Michael Allman <michael@videoamp.com> | 2016-12-06 11:33:35 +0800 |
---|---|---|
committer | Wenchen Fan <wenchen@databricks.com> | 2016-12-06 11:33:35 +0800 |
commit | 772ddbeaa6fe5abf189d01246f57d295f9346fa3 (patch) | |
tree | 2d4255f4677f58c4e2a94a40257dd56709fa0e27 /sql/hive/src/test/scala/org | |
parent | 4af142f55771affa5fc7f2abbbf5e47766194e6e (diff) | |
download | spark-772ddbeaa6fe5abf189d01246f57d295f9346fa3.tar.gz spark-772ddbeaa6fe5abf189d01246f57d295f9346fa3.tar.bz2 spark-772ddbeaa6fe5abf189d01246f57d295f9346fa3.zip |
[SPARK-18572][SQL] Add a method `listPartitionNames` to `ExternalCatalog`
(Link to Jira issue: https://issues.apache.org/jira/browse/SPARK-18572)
## What changes were proposed in this pull request?
Currently Spark answers the `SHOW PARTITIONS` command by fetching all of the table's partition metadata from the external catalog and constructing partition names therefrom. The Hive client has a `getPartitionNames` method which is many times faster for this purpose, with the performance improvement scaling with the number of partitions in a table.
To test the performance impact of this PR, I ran the `SHOW PARTITIONS` command on two Hive tables with large numbers of partitions. One table has ~17,800 partitions, and the other has ~95,000 partitions. For the purposes of this PR, I'll call the former table `table1` and the latter table `table2`. I ran 5 trials for each table with before-and-after versions of this PR. The results are as follows:
Spark at bdc8153, `SHOW PARTITIONS table1`, times in seconds:
7.901
3.983
4.018
4.331
4.261
Spark at bdc8153, `SHOW PARTITIONS table2`
(Timed out after 10 minutes with a `SocketTimeoutException`.)
Spark at this PR, `SHOW PARTITIONS table1`, times in seconds:
3.801
0.449
0.395
0.348
0.336
Spark at this PR, `SHOW PARTITIONS table2`, times in seconds:
5.184
1.63
1.474
1.519
1.41
Taking the best times from each trial, we get a 12x performance improvement for a table with ~17,800 partitions and at least a 426x improvement for a table with ~95,000 partitions. More significantly, the latter command doesn't even complete with the current code in master.
This is actually a patch we've been using in-house at VideoAmp since Spark 1.1. It's made all the difference in the practical usability of our largest tables. Even with tables with about 1,000 partitions there's a performance improvement of about 2-3x.
## How was this patch tested?
I added a unit test to `VersionsSuite` which tests that the Hive client's `getPartitionNames` method returns the correct number of partitions.
Author: Michael Allman <michael@videoamp.com>
Closes #15998 from mallman/spark-18572-list_partition_names.
Diffstat (limited to 'sql/hive/src/test/scala/org')
-rw-r--r-- | sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala | 5 |
1 files changed, 5 insertions, 0 deletions
diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala index 16ae345de6..79e76b3134 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala @@ -254,6 +254,11 @@ class VersionsSuite extends SparkFunSuite with Logging { "default", "src_part", partitions, ignoreIfExists = true) } + test(s"$version: getPartitionNames(catalogTable)") { + val partitionNames = (1 to testPartitionCount).map(key2 => s"key1=1/key2=$key2") + assert(partitionNames == client.getPartitionNames(client.getTable("default", "src_part"))) + } + test(s"$version: getPartitions(catalogTable)") { assert(testPartitionCount == client.getPartitions(client.getTable("default", "src_part")).size) |