diff options
author | Yanbo Liang <ybliang8@gmail.com> | 2016-08-12 10:06:17 -0700 |
---|---|---|
committer | Yanbo Liang <ybliang8@gmail.com> | 2016-08-12 10:06:17 -0700 |
commit | bbae20ade14e50541e4403ca7b45bf6c11695d15 (patch) | |
tree | 41d0da76679d36b07252e040078be071a41aea23 | |
parent | 79e2caa1328843457841d71642b60be919ebb1e0 (diff) | |
download | spark-bbae20ade14e50541e4403ca7b45bf6c11695d15.tar.gz spark-bbae20ade14e50541e4403ca7b45bf6c11695d15.tar.bz2 spark-bbae20ade14e50541e4403ca7b45bf6c11695d15.zip |
[SPARK-17033][ML][MLLIB] GaussianMixture should use treeAggregate to improve performance
## What changes were proposed in this pull request?
```GaussianMixture``` should use ```treeAggregate``` rather than ```aggregate``` to improve performance and scalability. In my test of dataset with 200 features and 1M instance, I found there is 20% increased performance.
BTW, we should destroy broadcast variable ```compute``` at the end of each iteration.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #14621 from yanboliang/spark-17033.
-rw-r--r-- | mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala | 3 |
1 files changed, 2 insertions, 1 deletions
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala b/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala index a214b1a26f..43193adf3e 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala @@ -198,7 +198,7 @@ class GaussianMixture private ( val compute = sc.broadcast(ExpectationSum.add(weights, gaussians)_) // aggregate the cluster contribution for all sample points - val sums = breezeData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) + val sums = breezeData.treeAggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) // Create new distributions based on the partial assignments // (often referred to as the "M" step in literature) @@ -227,6 +227,7 @@ class GaussianMixture private ( llhp = llh // current becomes previous llh = sums.logLikelihood // this is the freshly computed log-likelihood iter += 1 + compute.destroy(blocking = false) } new GaussianMixtureModel(weights, gaussians) |