aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorSean Owen <sowen@cloudera.com>2016-03-28 12:01:33 +0100
committerSean Owen <sowen@cloudera.com>2016-03-28 12:01:33 +0100
commit7b841540180e8d1403d6c95b02e93f129267b34f (patch)
tree95b5105e64bc651b14bd6129201fee6ba111a40d
parentaac13fb48c8aa7d6816ea46c2e40154913477717 (diff)
downloadspark-7b841540180e8d1403d6c95b02e93f129267b34f.tar.gz
spark-7b841540180e8d1403d6c95b02e93f129267b34f.tar.bz2
spark-7b841540180e8d1403d6c95b02e93f129267b34f.zip
[SPARK-12494][MLLIB] Array out of bound Exception in KMeans Yarn Mode
## What changes were proposed in this pull request? Better error message with k-means init can't be enough samples from input (because it is perhaps empty) ## How was this patch tested? Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes #11979 from srowen/SPARK-12494.
-rw-r--r--mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala2
1 files changed, 2 insertions, 0 deletions
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala b/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
index a7beb81980..37a21cd879 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
@@ -390,6 +390,8 @@ class KMeans private (
// Initialize each run's first center to a random point.
val seed = new XORShiftRandom(this.seed).nextInt()
val sample = data.takeSample(true, runs, seed).toSeq
+ // Could be empty if data is empty; fail with a better message early:
+ require(sample.size >= runs, s"Required $runs samples but got ${sample.size} from $data")
val newCenters = Array.tabulate(runs)(r => ArrayBuffer(sample(r).toDense))
/** Merges new centers to centers. */