aboutsummaryrefslogtreecommitdiff
path: root/mllib
diff options
context:
space:
mode:
authorYanbo Liang <ybliang8@gmail.com>2016-08-12 10:06:17 -0700
committerYanbo Liang <ybliang8@gmail.com>2016-08-12 10:06:17 -0700
commitbbae20ade14e50541e4403ca7b45bf6c11695d15 (patch)
tree41d0da76679d36b07252e040078be071a41aea23 /mllib
parent79e2caa1328843457841d71642b60be919ebb1e0 (diff)
downloadspark-bbae20ade14e50541e4403ca7b45bf6c11695d15.tar.gz
spark-bbae20ade14e50541e4403ca7b45bf6c11695d15.tar.bz2
spark-bbae20ade14e50541e4403ca7b45bf6c11695d15.zip
[SPARK-17033][ML][MLLIB] GaussianMixture should use treeAggregate to improve performance
## What changes were proposed in this pull request? ```GaussianMixture``` should use ```treeAggregate``` rather than ```aggregate``` to improve performance and scalability. In my test of dataset with 200 features and 1M instance, I found there is 20% increased performance. BTW, we should destroy broadcast variable ```compute``` at the end of each iteration. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14621 from yanboliang/spark-17033.
Diffstat (limited to 'mllib')
-rw-r--r--mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala3
1 files changed, 2 insertions, 1 deletions
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala b/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala
index a214b1a26f..43193adf3e 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala
@@ -198,7 +198,7 @@ class GaussianMixture private (
val compute = sc.broadcast(ExpectationSum.add(weights, gaussians)_)
// aggregate the cluster contribution for all sample points
- val sums = breezeData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _)
+ val sums = breezeData.treeAggregate(ExpectationSum.zero(k, d))(compute.value, _ += _)
// Create new distributions based on the partial assignments
// (often referred to as the "M" step in literature)
@@ -227,6 +227,7 @@ class GaussianMixture private (
llhp = llh // current becomes previous
llh = sums.logLikelihood // this is the freshly computed log-likelihood
iter += 1
+ compute.destroy(blocking = false)
}
new GaussianMixtureModel(weights, gaussians)