aboutsummaryrefslogtreecommitdiff
path: root/core/src/main/scala/org
diff options
context:
space:
mode:
authorSean Owen <sowen@cloudera.com>2016-12-07 17:34:45 +0800
committerSean Owen <sowen@cloudera.com>2016-12-07 17:34:45 +0800
commit79f5f281bb69cb2de9f64006180abd753e8ae427 (patch)
treea722ebc403cc4655e96e48b9e3de7502e04271a0 /core/src/main/scala/org
parentb8280271396eb74638da6546d76bbb2d06c7011b (diff)
downloadspark-79f5f281bb69cb2de9f64006180abd753e8ae427.tar.gz
spark-79f5f281bb69cb2de9f64006180abd753e8ae427.tar.bz2
spark-79f5f281bb69cb2de9f64006180abd753e8ae427.zip
[SPARK-18678][ML] Skewed reservoir sampling in SamplingUtils
## What changes were proposed in this pull request? Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k. ## How was this patch tested? Existing test plus new test case. Author: Sean Owen <sowen@cloudera.com> Closes #16129 from srowen/SPARK-18678.
Diffstat (limited to 'core/src/main/scala/org')
-rw-r--r--core/src/main/scala/org/apache/spark/util/random/SamplingUtils.scala5
1 files changed, 4 insertions, 1 deletions
diff --git a/core/src/main/scala/org/apache/spark/util/random/SamplingUtils.scala b/core/src/main/scala/org/apache/spark/util/random/SamplingUtils.scala
index 297524c943..a7e0075deb 100644
--- a/core/src/main/scala/org/apache/spark/util/random/SamplingUtils.scala
+++ b/core/src/main/scala/org/apache/spark/util/random/SamplingUtils.scala
@@ -56,11 +56,14 @@ private[spark] object SamplingUtils {
val rand = new XORShiftRandom(seed)
while (input.hasNext) {
val item = input.next()
+ l += 1
+ // There are k elements in the reservoir, and the l-th element has been
+ // consumed. It should be chosen with probability k/l. The expression
+ // below is a random long chosen uniformly from [0,l)
val replacementIndex = (rand.nextDouble() * l).toLong
if (replacementIndex < k) {
reservoir(replacementIndex.toInt) = item
}
- l += 1
}
(reservoir, l)
}