diff options
author | Sean Owen <sowen@cloudera.com> | 2016-12-07 17:34:45 +0800 |
---|---|---|
committer | Sean Owen <sowen@cloudera.com> | 2016-12-07 17:34:45 +0800 |
commit | 79f5f281bb69cb2de9f64006180abd753e8ae427 (patch) | |
tree | a722ebc403cc4655e96e48b9e3de7502e04271a0 /core/src/main/scala/org | |
parent | b8280271396eb74638da6546d76bbb2d06c7011b (diff) | |
download | spark-79f5f281bb69cb2de9f64006180abd753e8ae427.tar.gz spark-79f5f281bb69cb2de9f64006180abd753e8ae427.tar.bz2 spark-79f5f281bb69cb2de9f64006180abd753e8ae427.zip |
[SPARK-18678][ML] Skewed reservoir sampling in SamplingUtils
## What changes were proposed in this pull request?
Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k.
## How was this patch tested?
Existing test plus new test case.
Author: Sean Owen <sowen@cloudera.com>
Closes #16129 from srowen/SPARK-18678.
Diffstat (limited to 'core/src/main/scala/org')
-rw-r--r-- | core/src/main/scala/org/apache/spark/util/random/SamplingUtils.scala | 5 |
1 files changed, 4 insertions, 1 deletions
diff --git a/core/src/main/scala/org/apache/spark/util/random/SamplingUtils.scala b/core/src/main/scala/org/apache/spark/util/random/SamplingUtils.scala index 297524c943..a7e0075deb 100644 --- a/core/src/main/scala/org/apache/spark/util/random/SamplingUtils.scala +++ b/core/src/main/scala/org/apache/spark/util/random/SamplingUtils.scala @@ -56,11 +56,14 @@ private[spark] object SamplingUtils { val rand = new XORShiftRandom(seed) while (input.hasNext) { val item = input.next() + l += 1 + // There are k elements in the reservoir, and the l-th element has been + // consumed. It should be chosen with probability k/l. The expression + // below is a random long chosen uniformly from [0,l) val replacementIndex = (rand.nextDouble() * l).toLong if (replacementIndex < k) { reservoir(replacementIndex.toInt) = item } - l += 1 } (reservoir, l) } |