aboutsummaryrefslogtreecommitdiff
path: root/mllib
diff options
context:
space:
mode:
authorSean Owen <sowen@cloudera.com>2014-02-21 12:46:12 -0800
committerReynold Xin <rxin@apache.org>2014-02-21 12:46:12 -0800
commitc8a4c9b1f6005815f5a4a331970624d1706b6b13 (patch)
tree2bcd8b8e0843440c13196d417e83e7f4c6b452fb /mllib
parent45b15e27a84527abeaa8588b0eb1ade7e831e6ef (diff)
downloadspark-c8a4c9b1f6005815f5a4a331970624d1706b6b13.tar.gz
spark-c8a4c9b1f6005815f5a4a331970624d1706b6b13.tar.bz2
spark-c8a4c9b1f6005815f5a4a331970624d1706b6b13.zip
MLLIB-25: Implicit ALS runs out of memory for moderately large numbers of features
There's a step in implicit ALS where the matrix `Yt * Y` is computed. It's computed as the sum of matrices; an f x f matrix is created for each of n user/item rows in a partition. In `ALS.scala:214`: ``` factors.flatMapValues{ case factorArray => factorArray.map{ vector => val x = new DoubleMatrix(vector) x.mmul(x.transpose()) } }.reduceByKeyLocally((a, b) => a.addi(b)) .values .reduce((a, b) => a.addi(b)) ``` Completely correct, but there's a subtle but quite large memory problem here. map() is going to create all of these matrices in memory at once, when they don't need to ever all exist at the same time. For example, if a partition has n = 100000 rows, and f = 200, then this intermediate product requires 32GB of heap. The computation will never work unless you can cough up workers with (more than) that much heap. Fortunately there's a trivial change that fixes it; just add `.view` in there. Author: Sean Owen <sowen@cloudera.com> Closes #629 from srowen/ALSMatrixAllocationOptimization and squashes the following commits: 062cda9 [Sean Owen] Update style per review comments e9a5d63 [Sean Owen] Avoid unnecessary out of memory situation by not simultaneously allocating lots of matrices
Diffstat (limited to 'mllib')
-rw-r--r--mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala4
1 files changed, 2 insertions, 2 deletions
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala b/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala
index c668b0412c..8958040e36 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala
@@ -211,8 +211,8 @@ class ALS private (var numBlocks: Int, var rank: Int, var iterations: Int, var l
def computeYtY(factors: RDD[(Int, Array[Array[Double]])]) = {
if (implicitPrefs) {
Option(
- factors.flatMapValues{ case factorArray =>
- factorArray.map{ vector =>
+ factors.flatMapValues { case factorArray =>
+ factorArray.view.map { vector =>
val x = new DoubleMatrix(vector)
x.mmul(x.transpose())
}