[SPARK-16283][SQL] Implements percentile_approx aggregation function which supports partial aggregation.

## What changes were proposed in this pull request? This PR implements aggregation function `percentile_approx`. Function `percentile_approx` returns the approximate percentile(s) of a column at the given percentage(s). A percentile is a watermark value below which a given percentage of the column values fall. For example, the percentile of column `col` at percentage 50% is the median value of column `col`. ### Syntax: ``` # Returns percentile at a given percentage value. The approximation error can be reduced by increasing parameter accuracy, at the cost of memory. percentile_approx(col, percentage [, accuracy]) # Returns percentile value array at given percentage value array percentile_approx(col, array(percentage1 [, percentage2]...) [, accuracy]) ``` ### Features: 1. This function supports partial aggregation. 2. The memory consumption is bounded. The larger `accuracy` parameter we choose, we smaller error we get. The default accuracy value is 10000, to match with Hive default setting. Choose a smaller value for smaller memory footprint. 3. This function supports window function aggregation. ### Example usages: ``` ## Returns the 25th percentile value, with default accuracy SELECT percentile_approx(col, 0.25) FROM table ## Returns an array of percentile value (25th, 50th, 75th), with default accuracy SELECT percentile_approx(col, array(0.25, 0.5, 0.75)) FROM table ## Returns 25th percentile value, with custom accuracy value 100, larger accuracy parameter yields smaller approximation error SELECT percentile_approx(col, 0.25, 100) FROM table ## Returns the 25th, and 50th percentile values, with custom accuracy value 100 SELECT percentile_approx(col, array(0.25, 0.5), 100) FROM table ``` ### NOTE: 1. The `percentile_approx` implementation is different from Hive, so the result returned on same query maybe slightly different with Hive. This implementation uses `QuantileSummaries` as the underlying probabilistic data structure, and mainly follows paper `Space-efficient Online Computation of Quantile Summaries` by Greenwald, Michael and Khanna, Sanjeev. (http://dx.doi.org/10.1145/375663.375670)` 2. The current implementation of `QuantileSummaries` doesn't support automatic compression. This PR has a rule to do compression automatically at the caller side, but it may not be optimal. ## How was this patch tested? Unit test, and Sql query test. ## Acknowledgement 1. This PR's work in based on lw-lin's PR https://github.com/apache/spark/pull/14298, with improvements like supporting partial aggregation, fixing out of memory issue. Author: Sean Zhong <seanzhong@databricks.com> Closes #14868 from clockfly/appro_percentile_try_2.
author: Sean Zhong <seanzhong@databricks.com> 2016-09-01 16:31:13 +0800
committer: Wenchen Fan <wenchen@databricks.com> 2016-09-01 16:31:13 +0800
commit: a18c169fd050e71fdb07b153ae0fa5c410d8de27 (patch)
tree: be693af0f087329fd457ad0c238ffb2a3a9f8bab /sql/hive/src
parent: 536fa911c181958d84f14156f7d57ef5fd68df48 (diff)
download: spark-a18c169fd050e71fdb07b153ae0fa5c410d8de27.tar.gz
spark-a18c169fd050e71fdb07b153ae0fa5c410d8de27.tar.bz2
spark-a18c169fd050e71fdb07b153ae0fa5c410d8de27.zip
2 files changed, 6 insertions, 2 deletions
diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala
index bfa5899faf..85c509847d 100644
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala
@@ -227,7 +227,6 @@ private[sql] class HiveSessionCatalog(
   private val hiveFunctions = Seq(
     "hash",
     "histogram_numeric",
-    "percentile",
-    "percentile_approx"
+    "percentile"
   )
 }
diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/ExpressionToSQLSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/ExpressionToSQLSuite.scala
index b4eb50e331..fdd02821df 100644
--- a/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/ExpressionToSQLSuite.scala
+++ b/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/ExpressionToSQLSuite.scala
@@ -155,6 +155,11 @@ class ExpressionToSQLSuite extends SQLBuilderTest with SQLTestUtils {
 
   test("aggregate functions") {
     checkSqlGeneration("SELECT approx_count_distinct(value) FROM t1 GROUP BY key")
+    checkSqlGeneration("SELECT percentile_approx(value, 0.25) FROM t1 GROUP BY key")
+    checkSqlGeneration("SELECT percentile_approx(value, array(0.25, 0.75)) FROM t1 GROUP BY key")
+    checkSqlGeneration("SELECT percentile_approx(value, 0.25, 100) FROM t1 GROUP BY key")
+    checkSqlGeneration(
+      "SELECT percentile_approx(value, array(0.25, 0.75), 100) FROM t1 GROUP BY key")
     checkSqlGeneration("SELECT avg(value) FROM t1 GROUP BY key")
     checkSqlGeneration("SELECT corr(value, key) FROM t1 GROUP BY key")
     checkSqlGeneration("SELECT count(value) FROM t1 GROUP BY key")
author	Sean Zhong <seanzhong@databricks.com>	2016-09-01 16:31:13 +0800
committer	Wenchen Fan <wenchen@databricks.com>	2016-09-01 16:31:13 +0800
commit	a18c169fd050e71fdb07b153ae0fa5c410d8de27 (patch)
tree	be693af0f087329fd457ad0c238ffb2a3a9f8bab /sql/hive/src
parent	536fa911c181958d84f14156f7d57ef5fd68df48 (diff)
download	spark-a18c169fd050e71fdb07b153ae0fa5c410d8de27.tar.gz spark-a18c169fd050e71fdb07b153ae0fa5c410d8de27.tar.bz2 spark-a18c169fd050e71fdb07b153ae0fa5c410d8de27.zip