diff options
Diffstat (limited to 'python/pyspark/rdd.py')
-rw-r--r-- | python/pyspark/rdd.py | 18 |
1 files changed, 9 insertions, 9 deletions
diff --git a/python/pyspark/rdd.py b/python/pyspark/rdd.py index 6afe769662..0508235c1c 100644 --- a/python/pyspark/rdd.py +++ b/python/pyspark/rdd.py @@ -1027,20 +1027,20 @@ class RDD(object): If your histogram is evenly spaced (e.g. [0, 10, 20, 30]), this can be switched from an O(log n) inseration to O(1) per - element(where n = # buckets). + element (where n is the number of buckets). - Buckets must be sorted and not contain any duplicates, must be + Buckets must be sorted, not contain any duplicates, and have at least two elements. - If `buckets` is a number, it will generates buckets which are + If `buckets` is a number, it will generate buckets which are evenly spaced between the minimum and maximum of the RDD. For - example, if the min value is 0 and the max is 100, given buckets - as 2, the resulting buckets will be [0,50) [50,100]. buckets must - be at least 1 If the RDD contains infinity, NaN throws an exception - If the elements in RDD do not vary (max == min) always returns - a single bucket. + example, if the min value is 0 and the max is 100, given `buckets` + as 2, the resulting buckets will be [0,50) [50,100]. `buckets` must + be at least 1. An exception is raised if the RDD contains infinity. + If the elements in the RDD do not vary (max == min), a single bucket + will be used. - It will return a tuple of buckets and histogram. + The return value is a tuple of buckets and histogram. >>> rdd = sc.parallelize(range(51)) >>> rdd.histogram(2) |