aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorAaron Staple <aaron.staple@gmail.com>2014-09-25 16:11:00 -0700
committerXiangrui Meng <meng@databricks.com>2014-09-25 16:11:00 -0700
commitff637c9380a6342fd0a4dde0710ec23856751dd4 (patch)
treef126756bb8adecd89014b5c18df60c4a6b44bf19 /docs
parent9b56e249e09d8da20f703b9381c5c3c8a1a1d4a9 (diff)
downloadspark-ff637c9380a6342fd0a4dde0710ec23856751dd4.tar.gz
spark-ff637c9380a6342fd0a4dde0710ec23856751dd4.tar.bz2
spark-ff637c9380a6342fd0a4dde0710ec23856751dd4.zip
[SPARK-1484][MLLIB] Warn when running an iterative algorithm on uncached data.
Add warnings to KMeans, GeneralizedLinearAlgorithm, and computeSVD when called with input data that is not cached. KMeans is implemented iteratively, and I believe that GeneralizedLinearAlgorithm’s current optimizers are iterative and its future optimizers are also likely to be iterative. RowMatrix’s computeSVD is iterative against an RDD when run in DistARPACK mode. ALS and DecisionTree are iterative as well, but they implement RDD caching internally so do not require a warning. I added a warning to GeneralizedLinearAlgorithm rather than inside its optimizers, where the iteration actually occurs, because internally GeneralizedLinearAlgorithm maps its input data to an uncached RDD before passing it to an optimizer. (In other words, the warning would be printed for every GeneralizedLinearAlgorithm run, regardless of whether its input is cached, if the warning were in GradientDescent or other optimizer.) I assume that use of an uncached RDD by GeneralizedLinearAlgorithm is intentional, and that the mapping there (adding label, intercepts and scaling) is a lightweight operation. Arguably a user calling an optimizer such as GradientDescent will be knowledgable enough to cache their data without needing a log warning, so lack of a warning in the optimizers may be ok. Some of the documentation examples making use of these iterative algorithms did not cache their training RDDs (while others did). I updated the examples to always cache. I also fixed some (unrelated) minor errors in the documentation examples. Author: Aaron Staple <aaron.staple@gmail.com> Closes #2347 from staple/SPARK-1484 and squashes the following commits: bd49701 [Aaron Staple] Address review comments. ab2d4a4 [Aaron Staple] Disable warnings on python code path. a7a0f99 [Aaron Staple] Change code comments per review comments. 7cca1dc [Aaron Staple] Change warning message text. c77e939 [Aaron Staple] [SPARK-1484][MLLIB] Warn when running an iterative algorithm on uncached data. 3b6c511 [Aaron Staple] Minor doc example fixes.
Diffstat (limited to 'docs')
-rw-r--r--docs/mllib-clustering.md3
-rw-r--r--docs/mllib-linear-methods.md9
-rw-r--r--docs/mllib-optimization.md1
3 files changed, 8 insertions, 5 deletions
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index dfd9cd5728..d10bd63746 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -52,7 +52,7 @@ import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("data/mllib/kmeans_data.txt")
-val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble)))
+val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
// Cluster the data into two classes using KMeans
val numClusters = 2
@@ -100,6 +100,7 @@ public class KMeansExample {
}
}
);
+ parsedData.cache();
// Cluster the data into two classes using KMeans
int numClusters = 2;
diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md
index 9137f9dc1b..d31bec3e1b 100644
--- a/docs/mllib-linear-methods.md
+++ b/docs/mllib-linear-methods.md
@@ -396,7 +396,7 @@ val data = sc.textFile("data/mllib/ridge-data/lpsa.data")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
-}
+}.cache()
// Building the model
val numIterations = 100
@@ -455,6 +455,7 @@ public class LinearRegression {
}
}
);
+ parsedData.cache();
// Building the model
int numIterations = 100;
@@ -470,7 +471,7 @@ public class LinearRegression {
}
}
);
- JavaRDD<Object> MSE = new JavaDoubleRDD(valuesAndPreds.map(
+ double MSE = new JavaDoubleRDD(valuesAndPreds.map(
new Function<Tuple2<Double, Double>, Object>() {
public Object call(Tuple2<Double, Double> pair) {
return Math.pow(pair._1() - pair._2(), 2.0);
@@ -553,8 +554,8 @@ but in practice you will likely want to use unlabeled vectors for test data.
{% highlight scala %}
-val trainingData = ssc.textFileStream('/training/data/dir').map(LabeledPoint.parse)
-val testData = ssc.textFileStream('/testing/data/dir').map(LabeledPoint.parse)
+val trainingData = ssc.textFileStream("/training/data/dir").map(LabeledPoint.parse).cache()
+val testData = ssc.textFileStream("/testing/data/dir").map(LabeledPoint.parse)
{% endhighlight %}
diff --git a/docs/mllib-optimization.md b/docs/mllib-optimization.md
index 26ce5f3c50..45141c235b 100644
--- a/docs/mllib-optimization.md
+++ b/docs/mllib-optimization.md
@@ -217,6 +217,7 @@ import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.classification.LogisticRegressionModel
+import org.apache.spark.mllib.optimization.{LBFGS, LogisticGradient, SquaredL2Updater}
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
val numFeatures = data.take(1)(0).features.size