diff options
author | gagan taneja <tanejagagan@gagans-MacBook-Pro.local> | 2017-02-07 14:05:22 +0100 |
---|---|---|
committer | Herman van Hovell <hvanhovell@databricks.com> | 2017-02-07 14:05:22 +0100 |
commit | e99e34d0f370211a7c7b96d144cc932b2fc71d10 (patch) | |
tree | 06cd312cf7437f0b221937664ea34c983a0faf3b /R/pkg/inst | |
parent | 3d314d08c9420e74b4bb687603cdd11394eccab5 (diff) | |
download | spark-e99e34d0f370211a7c7b96d144cc932b2fc71d10.tar.gz spark-e99e34d0f370211a7c7b96d144cc932b2fc71d10.tar.bz2 spark-e99e34d0f370211a7c7b96d144cc932b2fc71d10.zip |
[SPARK-19118][SQL] Percentile support for frequency distribution table
## What changes were proposed in this pull request?
I have a frequency distribution table with following entries
Age, No of person
21, 10
22, 15
23, 18
..
..
30, 14
Moreover it is common to have data in frequency distribution format to further calculate Percentile, Median. With current implementation
It would be very difficult and complex to find the percentile.
Therefore i am proposing enhancement to current Percentile and Approx Percentile implementation to take frequency distribution column into consideration
## How was this patch tested?
1) Enhanced /sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PercentileSuite.scala to cover the additional functionality
2) Run some performance benchmark test with 20 million row in local environment and did not see any performance degradation
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: gagan taneja <tanejagagan@gagans-MacBook-Pro.local>
Closes #16497 from tanejagagan/branch-18940.
Diffstat (limited to 'R/pkg/inst')
0 files changed, 0 insertions, 0 deletions