diff options
author | Davies Liu <davies.liu@gmail.com> | 2014-08-26 13:04:30 -0700 |
---|---|---|
committer | Josh Rosen <joshrosen@apache.org> | 2014-08-26 13:05:35 -0700 |
commit | 83d273023b03faa0ceacd69956a132f40d247bc1 (patch) | |
tree | 45cb9e37d367109998397defdf4fd5afa9a9cc54 /sql | |
parent | 3a9d874d7a46ab8b015631d91ba479d9a0ba827f (diff) | |
download | spark-83d273023b03faa0ceacd69956a132f40d247bc1.tar.gz spark-83d273023b03faa0ceacd69956a132f40d247bc1.tar.bz2 spark-83d273023b03faa0ceacd69956a132f40d247bc1.zip |
[SPARK-2871] [PySpark] add histgram() API
RDD.histogram(buckets)
Compute a histogram using the provided buckets. The buckets
are all open to the right except for the last which is closed.
e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50],
which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1
and 50 we would have a histogram of 1,0,1.
If your histogram is evenly spaced (e.g. [0, 10, 20, 30]),
this can be switched from an O(log n) inseration to O(1) per
element(where n = # buckets).
Buckets must be sorted and not contain any duplicates, must be
at least two elements.
If `buckets` is a number, it will generates buckets which is
evenly spaced between the minimum and maximum of the RDD. For
example, if the min value is 0 and the max is 100, given buckets
as 2, the resulting buckets will be [0,50) [50,100]. buckets must
be at least 1 If the RDD contains infinity, NaN throws an exception
If the elements in RDD do not vary (max == min) always returns
a single bucket.
It will return an tuple of buckets and histogram.
>>> rdd = sc.parallelize(range(51))
>>> rdd.histogram(2)
([0, 25, 50], [25, 26])
>>> rdd.histogram([0, 5, 25, 50])
([0, 5, 25, 50], [5, 20, 26])
>>> rdd.histogram([0, 15, 30, 45, 60], True)
([0, 15, 30, 45, 60], [15, 15, 15, 6])
>>> rdd = sc.parallelize(["ab", "ac", "b", "bd", "ef"])
>>> rdd.histogram(("a", "b", "c"))
(('a', 'b', 'c'), [2, 2])
closes #122, it's duplicated.
Author: Davies Liu <davies.liu@gmail.com>
Closes #2091 from davies/histgram and squashes the following commits:
a322f8a [Davies Liu] fix deprecation of e.message
84e85fa [Davies Liu] remove evenBuckets, add more tests (including str)
d9a0722 [Davies Liu] address comments
0e18a2d [Davies Liu] add histgram() API
(cherry picked from commit 3cedc4f4d78e093fd362085e0a077bb9e4f28ca5)
Signed-off-by: Josh Rosen <joshrosen@apache.org>
Diffstat (limited to 'sql')
0 files changed, 0 insertions, 0 deletions