[SPARK-18678][ML] Skewed reservoir sampling in SamplingUtils

## What changes were proposed in this pull request? Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k. ## How was this patch tested? Existing test plus new test case. Author: Sean Owen <sowen@cloudera.com> Closes #16129 from srowen/SPARK-18678.
author: Sean Owen <sowen@cloudera.com> 2016-12-07 17:34:45 +0800
committer: Sean Owen <sowen@cloudera.com> 2016-12-07 17:34:45 +0800
commit: 79f5f281bb69cb2de9f64006180abd753e8ae427 (patch)
tree: a722ebc403cc4655e96e48b9e3de7502e04271a0 /core/src/main/scala/org
parent: b8280271396eb74638da6546d76bbb2d06c7011b (diff)
download: spark-79f5f281bb69cb2de9f64006180abd753e8ae427.tar.gz
spark-79f5f281bb69cb2de9f64006180abd753e8ae427.tar.bz2
spark-79f5f281bb69cb2de9f64006180abd753e8ae427.zip
1 files changed, 4 insertions, 1 deletions
diff --git a/core/src/main/scala/org/apache/spark/util/random/SamplingUtils.scala b/core/src/main/scala/org/apache/spark/util/random/SamplingUtils.scala
index 297524c943..a7e0075deb 100644
--- a/core/src/main/scala/org/apache/spark/util/random/SamplingUtils.scala
+++ b/core/src/main/scala/org/apache/spark/util/random/SamplingUtils.scala
@@ -56,11 +56,14 @@ private[spark] object SamplingUtils {
       val rand = new XORShiftRandom(seed)
       while (input.hasNext) {
         val item = input.next()
+        l += 1
+        // There are k elements in the reservoir, and the l-th element has been
+        // consumed. It should be chosen with probability k/l. The expression
+        // below is a random long chosen uniformly from [0,l)
         val replacementIndex = (rand.nextDouble() * l).toLong
         if (replacementIndex < k) {
           reservoir(replacementIndex.toInt) = item
         }
-        l += 1
       }
       (reservoir, l)
     }
author	Sean Owen <sowen@cloudera.com>	2016-12-07 17:34:45 +0800
committer	Sean Owen <sowen@cloudera.com>	2016-12-07 17:34:45 +0800
commit	79f5f281bb69cb2de9f64006180abd753e8ae427 (patch)
tree	a722ebc403cc4655e96e48b9e3de7502e04271a0 /core/src/main/scala/org
parent	b8280271396eb74638da6546d76bbb2d06c7011b (diff)
download	spark-79f5f281bb69cb2de9f64006180abd753e8ae427.tar.gz spark-79f5f281bb69cb2de9f64006180abd753e8ae427.tar.bz2 spark-79f5f281bb69cb2de9f64006180abd753e8ae427.zip