[SPARK-19319][SPARKR] SparkR Kmeans summary returns error when the cluster size doesn't equal to k - spark

diff options

author	wm624@hotmail.com <wm624@hotmail.com>	2017-01-31 21:16:37 -0800
committer	Felix Cheung <felixcheung@apache.org>	2017-01-31 21:16:37 -0800
commit	9ac05225e870e41dc86cd6d61c7f0d111d172810 (patch)
tree	0576e4c396b0fd34abb94c63bdcae2beb623d64a /docs/job-scheduling.md
parent	9063835803e54538c94d95bbddcb4810dd7a1c55 (diff)
download	spark-9ac05225e870e41dc86cd6d61c7f0d111d172810.tar.gz spark-9ac05225e870e41dc86cd6d61c7f0d111d172810.tar.bz2 spark-9ac05225e870e41dc86cd6d61c7f0d111d172810.zip

[SPARK-19319][SPARKR] SparkR Kmeans summary returns error when the cluster size doesn't equal to k

## What changes were proposed in this pull request When Kmeans using initMode = "random" and some random seed, it is possible the actual cluster size doesn't equal to the configured `k`. In this case, summary(model) returns error due to the number of cols of coefficient matrix doesn't equal to k. Example: > col1 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0) > col2 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0) > col3 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0) > cols <- as.data.frame(cbind(col1, col2, col3)) > df <- createDataFrame(cols) > > model2 <- spark.kmeans(data = df, ~ ., k = 5, maxIter = 10, initMode = "random", seed = 22222, tol = 1E-5) > > summary(model2) Error in `colnames<-`(`*tmp*`, value = c("col1", "col2", "col3")) : length of 'dimnames' [2] not equal to array extent In addition: Warning message: In matrix(coefficients, ncol = k) : data length [9] is not a sub-multiple or multiple of the number of rows [2] Fix: Get the actual cluster size in the summary and use it to build the coefficient matrix. ## How was this patch tested? Add unit tests. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16666 from wangmiao1981/kmeans.

Diffstat (limited to 'docs/job-scheduling.md')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: