aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--docs/mllib-classification-regression.md9
-rw-r--r--docs/mllib-collaborative-filtering.md9
-rw-r--r--docs/mllib-decision-tree.md20
-rw-r--r--docs/mllib-ensembles.md94
-rw-r--r--docs/mllib-linear-methods.md21
-rw-r--r--docs/mllib-naive-bayes.md10
-rw-r--r--python/pyspark/mllib/feature.py67
-rw-r--r--python/pyspark/mllib/regression.py7
-rw-r--r--python/pyspark/mllib/tree.py156
9 files changed, 296 insertions, 97 deletions
diff --git a/docs/mllib-classification-regression.md b/docs/mllib-classification-regression.md
index 5b9b4dd83b..8e91d62f4a 100644
--- a/docs/mllib-classification-regression.md
+++ b/docs/mllib-classification-regression.md
@@ -17,13 +17,13 @@ the supported algorithms for each type of problem.
</thead>
<tbody>
<tr>
- <td>Binary Classification</td><td>linear SVMs, logistic regression, decision trees, naive Bayes</td>
+ <td>Binary Classification</td><td>linear SVMs, logistic regression, decision trees, random forests, gradient-boosted trees, naive Bayes</td>
</tr>
<tr>
- <td>Multiclass Classification</td><td>decision trees, naive Bayes</td>
+ <td>Multiclass Classification</td><td>decision trees, random forests, naive Bayes</td>
</tr>
<tr>
- <td>Regression</td><td>linear least squares, Lasso, ridge regression, decision trees, isotonic regression</td>
+ <td>Regression</td><td>linear least squares, Lasso, ridge regression, decision trees, random forests, gradient-boosted trees, isotonic regression</td>
</tr>
</tbody>
</table>
@@ -34,5 +34,8 @@ More details for these methods can be found here:
* [binary classification (SVMs, logistic regression)](mllib-linear-methods.html#binary-classification)
* [linear regression (least squares, Lasso, ridge)](mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression)
* [Decision trees](mllib-decision-tree.html)
+* [Ensembles of decision trees](mllib-ensembles.html)
+ * [random forests](mllib-ensembles.html#random-forests)
+ * [gradient-boosted trees](mllib-ensembles.html#gradient-boosted-trees-gbts)
* [Naive Bayes](mllib-naive-bayes.html)
* [Isotonic regression](mllib-isotonic-regression.html)
diff --git a/docs/mllib-collaborative-filtering.md b/docs/mllib-collaborative-filtering.md
index ef18cec937..935cd8dad3 100644
--- a/docs/mllib-collaborative-filtering.md
+++ b/docs/mllib-collaborative-filtering.md
@@ -66,6 +66,7 @@ recommendation model by measuring the Mean Squared Error of rating prediction.
{% highlight scala %}
import org.apache.spark.mllib.recommendation.ALS
+import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
import org.apache.spark.mllib.recommendation.Rating
// Load and parse the data
@@ -95,6 +96,9 @@ val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) =>
err * err
}.mean()
println("Mean Squared Error = " + MSE)
+
+model.save("myModelPath")
+val sameModel = MatrixFactorizationModel.load("myModelPath")
{% endhighlight %}
If the rating matrix is derived from another source of information (e.g., it is inferred from
@@ -181,6 +185,9 @@ public class CollaborativeFiltering {
}
).rdd()).mean();
System.out.println("Mean Squared Error = " + MSE);
+
+ model.save("myModelPath");
+ MatrixFactorizationModel sameModel = MatrixFactorizationModel.load("myModelPath");
}
}
{% endhighlight %}
@@ -191,6 +198,8 @@ In the following example we load rating data. Each row consists of a user, a pro
We use the default ALS.train() method which assumes ratings are explicit. We evaluate the
recommendation by measuring the Mean Squared Error of rating prediction.
+Note that the Python API does not yet support model save/load but will in the future.
+
{% highlight python %}
from pyspark.mllib.recommendation import ALS, Rating
diff --git a/docs/mllib-decision-tree.md b/docs/mllib-decision-tree.md
index 6675133a81..4695d1cde4 100644
--- a/docs/mllib-decision-tree.md
+++ b/docs/mllib-decision-tree.md
@@ -194,6 +194,7 @@ maximum tree depth of 5. The test error is calculated to measure the algorithm a
<div data-lang="scala">
{% highlight scala %}
import org.apache.spark.mllib.tree.DecisionTree
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
// Load and parse the data file.
@@ -221,6 +222,9 @@ val labelAndPreds = testData.map { point =>
val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
println("Test Error = " + testErr)
println("Learned classification tree model:\n" + model.toDebugString)
+
+model.save("myModelPath")
+val sameModel = DecisionTreeModel.load("myModelPath")
{% endhighlight %}
</div>
@@ -279,10 +283,16 @@ Double testErr =
}).count() / testData.count();
System.out.println("Test Error: " + testErr);
System.out.println("Learned classification tree model:\n" + model.toDebugString());
+
+model.save("myModelPath");
+DecisionTreeModel sameModel = DecisionTreeModel.load("myModelPath");
{% endhighlight %}
</div>
<div data-lang="python">
+
+Note that the Python API does not yet support model save/load but will in the future.
+
{% highlight python %}
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree
@@ -324,6 +334,7 @@ depth of 5. The Mean Squared Error (MSE) is computed at the end to evaluate
<div data-lang="scala">
{% highlight scala %}
import org.apache.spark.mllib.tree.DecisionTree
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
// Load and parse the data file.
@@ -350,6 +361,9 @@ val labelsAndPredictions = testData.map { point =>
val testMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v - p), 2)}.mean()
println("Test Mean Squared Error = " + testMSE)
println("Learned regression tree model:\n" + model.toDebugString)
+
+model.save("myModelPath")
+val sameModel = DecisionTreeModel.load("myModelPath")
{% endhighlight %}
</div>
@@ -414,10 +428,16 @@ Double testMSE =
}) / data.count();
System.out.println("Test Mean Squared Error: " + testMSE);
System.out.println("Learned regression tree model:\n" + model.toDebugString());
+
+model.save("myModelPath");
+DecisionTreeModel sameModel = DecisionTreeModel.load("myModelPath");
{% endhighlight %}
</div>
<div data-lang="python">
+
+Note that the Python API does not yet support model save/load but will in the future.
+
{% highlight python %}
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree
diff --git a/docs/mllib-ensembles.md b/docs/mllib-ensembles.md
index 00040e6073..ddae84165f 100644
--- a/docs/mllib-ensembles.md
+++ b/docs/mllib-ensembles.md
@@ -98,6 +98,7 @@ The test error is calculated to measure the algorithm accuracy.
<div data-lang="scala">
{% highlight scala %}
import org.apache.spark.mllib.tree.RandomForest
+import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
// Load and parse the data file.
@@ -127,6 +128,9 @@ val labelAndPreds = testData.map { point =>
val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
println("Test Error = " + testErr)
println("Learned classification forest model:\n" + model.toDebugString)
+
+model.save("myModelPath")
+val sameModel = RandomForestModel.load("myModelPath")
{% endhighlight %}
</div>
@@ -188,10 +192,16 @@ Double testErr =
}).count() / testData.count();
System.out.println("Test Error: " + testErr);
System.out.println("Learned classification forest model:\n" + model.toDebugString());
+
+model.save("myModelPath");
+RandomForestModel sameModel = RandomForestModel.load("myModelPath");
{% endhighlight %}
</div>
<div data-lang="python">
+
+Note that the Python API does not yet support model save/load but will in the future.
+
{% highlight python %}
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils
@@ -235,6 +245,7 @@ The Mean Squared Error (MSE) is computed at the end to evaluate
<div data-lang="scala">
{% highlight scala %}
import org.apache.spark.mllib.tree.RandomForest
+import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
// Load and parse the data file.
@@ -264,6 +275,9 @@ val labelsAndPredictions = testData.map { point =>
val testMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v - p), 2)}.mean()
println("Test Mean Squared Error = " + testMSE)
println("Learned regression forest model:\n" + model.toDebugString)
+
+model.save("myModelPath")
+val sameModel = RandomForestModel.load("myModelPath")
{% endhighlight %}
</div>
@@ -328,10 +342,16 @@ Double testMSE =
}) / testData.count();
System.out.println("Test Mean Squared Error: " + testMSE);
System.out.println("Learned regression forest model:\n" + model.toDebugString());
+
+model.save("myModelPath");
+RandomForestModel sameModel = RandomForestModel.load("myModelPath");
{% endhighlight %}
</div>
<div data-lang="python">
+
+Note that the Python API does not yet support model save/load but will in the future.
+
{% highlight python %}
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils
@@ -441,8 +461,6 @@ iterations.
### Examples
-GBTs currently have APIs in Scala and Java. Examples in both languages are shown below.
-
#### Classification
The example below demonstrates how to load a
@@ -457,6 +475,7 @@ The test error is calculated to measure the algorithm accuracy.
{% highlight scala %}
import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
+import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel
import org.apache.spark.mllib.util.MLUtils
// Load and parse the data file.
@@ -484,6 +503,9 @@ val labelAndPreds = testData.map { point =>
val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
println("Test Error = " + testErr)
println("Learned classification GBT model:\n" + model.toDebugString)
+
+model.save("myModelPath")
+val sameModel = GradientBoostedTreesModel.load("myModelPath")
{% endhighlight %}
</div>
@@ -545,6 +567,38 @@ Double testErr =
}).count() / testData.count();
System.out.println("Test Error: " + testErr);
System.out.println("Learned classification GBT model:\n" + model.toDebugString());
+
+model.save("myModelPath");
+GradientBoostedTreesModel sameModel = GradientBoostedTreesModel.load("myModelPath");
+{% endhighlight %}
+</div>
+
+<div data-lang="python">
+
+Note that the Python API does not yet support model save/load but will in the future.
+
+{% highlight python %}
+from pyspark.mllib.tree import GradientBoostedTrees
+from pyspark.mllib.util import MLUtils
+
+# Load and parse the data file.
+data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
+# Split the data into training and test sets (30% held out for testing)
+(trainingData, testData) = data.randomSplit([0.7, 0.3])
+
+# Train a GradientBoostedTrees model.
+# Notes: (a) Empty categoricalFeaturesInfo indicates all features are continuous.
+# (b) Use more iterations in practice.
+model = GradientBoostedTrees.trainClassifier(trainingData,
+ categoricalFeaturesInfo={}, numIterations=3)
+
+# Evaluate model on test instances and compute test error
+predictions = model.predict(testData.map(lambda x: x.features))
+labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
+testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
+print('Test Error = ' + str(testErr))
+print('Learned classification GBT model:')
+print(model.toDebugString())
{% endhighlight %}
</div>
@@ -565,6 +619,7 @@ The Mean Squared Error (MSE) is computed at the end to evaluate
{% highlight scala %}
import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
+import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel
import org.apache.spark.mllib.util.MLUtils
// Load and parse the data file.
@@ -591,6 +646,9 @@ val labelsAndPredictions = testData.map { point =>
val testMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v - p), 2)}.mean()
println("Test Mean Squared Error = " + testMSE)
println("Learned regression GBT model:\n" + model.toDebugString)
+
+model.save("myModelPath")
+val sameModel = GradientBoostedTreesModel.load("myModelPath")
{% endhighlight %}
</div>
@@ -658,6 +716,38 @@ Double testMSE =
}) / data.count();
System.out.println("Test Mean Squared Error: " + testMSE);
System.out.println("Learned regression GBT model:\n" + model.toDebugString());
+
+model.save("myModelPath");
+GradientBoostedTreesModel sameModel = GradientBoostedTreesModel.load("myModelPath");
+{% endhighlight %}
+</div>
+
+<div data-lang="python">
+
+Note that the Python API does not yet support model save/load but will in the future.
+
+{% highlight python %}
+from pyspark.mllib.tree import GradientBoostedTrees
+from pyspark.mllib.util import MLUtils
+
+# Load and parse the data file.
+data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
+# Split the data into training and test sets (30% held out for testing)
+(trainingData, testData) = data.randomSplit([0.7, 0.3])
+
+# Train a GradientBoostedTrees model.
+# Notes: (a) Empty categoricalFeaturesInfo indicates all features are continuous.
+# (b) Use more iterations in practice.
+model = GradientBoostedTrees.trainRegressor(trainingData,
+ categoricalFeaturesInfo={}, numIterations=3)
+
+# Evaluate model on test instances and compute test error
+predictions = model.predict(testData.map(lambda x: x.features))
+labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
+testMSE = labelsAndPredictions.map(lambda (v, p): (v - p) * (v - p)).sum() / float(testData.count())
+print('Test Mean Squared Error = ' + str(testMSE))
+print('Learned regression GBT model:')
+print(model.toDebugString())
{% endhighlight %}
</div>
diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md
index 44b7f67c57..d9fc63b37d 100644
--- a/docs/mllib-linear-methods.md
+++ b/docs/mllib-linear-methods.md
@@ -190,7 +190,7 @@ error.
{% highlight scala %}
import org.apache.spark.SparkContext
-import org.apache.spark.mllib.classification.SVMWithSGD
+import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
@@ -222,6 +222,9 @@ val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()
println("Area under ROC = " + auROC)
+
+model.save("myModelPath")
+val sameModel = SVMModel.load("myModelPath")
{% endhighlight %}
The `SVMWithSGD.train()` method by default performs L2 regularization with the
@@ -304,6 +307,9 @@ public class SVMClassifier {
double auROC = metrics.areaUnderROC();
System.out.println("Area under ROC = " + auROC);
+
+ model.save("myModelPath");
+ SVMModel sameModel = SVMModel.load("myModelPath");
}
}
{% endhighlight %}
@@ -338,6 +344,8 @@ a dependency.
The following example shows how to load a sample dataset, build Logistic Regression model,
and make predictions with the resulting model to compute the training error.
+Note that the Python API does not yet support model save/load but will in the future.
+
{% highlight python %}
from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint
@@ -391,8 +399,9 @@ values. We compute the mean squared error at the end to evaluate
[goodness of fit](http://en.wikipedia.org/wiki/Goodness_of_fit).
{% highlight scala %}
-import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.regression.LinearRegressionModel
+import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
@@ -413,6 +422,9 @@ val valuesAndPreds = parsedData.map { point =>
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
+
+model.save("myModelPath")
+val sameModel = LinearRegressionModel.load("myModelPath")
{% endhighlight %}
[`RidgeRegressionWithSGD`](api/scala/index.html#org.apache.spark.mllib.regression.RidgeRegressionWithSGD)
@@ -483,6 +495,9 @@ public class LinearRegression {
}
).rdd()).mean();
System.out.println("training Mean Squared Error = " + MSE);
+
+ model.save("myModelPath");
+ LinearRegressionModel sameModel = LinearRegressionModel.load("myModelPath");
}
}
{% endhighlight %}
@@ -494,6 +509,8 @@ The example then uses LinearRegressionWithSGD to build a simple linear model to
values. We compute the mean squared error at the end to evaluate
[goodness of fit](http://en.wikipedia.org/wiki/Goodness_of_fit).
+Note that the Python API does not yet support model save/load but will in the future.
+
{% highlight python %}
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD
from numpy import array
diff --git a/docs/mllib-naive-bayes.md b/docs/mllib-naive-bayes.md
index d5b044d94f..81173255b5 100644
--- a/docs/mllib-naive-bayes.md
+++ b/docs/mllib-naive-bayes.md
@@ -37,7 +37,7 @@ smoothing parameter `lambda` as input, and output a
can be used for evaluation and prediction.
{% highlight scala %}
-import org.apache.spark.mllib.classification.NaiveBayes
+import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
@@ -55,6 +55,9 @@ val model = NaiveBayes.train(training, lambda = 1.0)
val predictionAndLabel = test.map(p => (model.predict(p.features), p.label))
val accuracy = 1.0 * predictionAndLabel.filter(x => x._1 == x._2).count() / test.count()
+
+model.save("myModelPath")
+val sameModel = NaiveBayesModel.load("myModelPath")
{% endhighlight %}
</div>
@@ -93,6 +96,9 @@ double accuracy = predictionAndLabel.filter(new Function<Tuple2<Double, Double>,
return pl._1().equals(pl._2());
}
}).count() / (double) test.count();
+
+model.save("myModelPath");
+NaiveBayesModel sameModel = NaiveBayesModel.load("myModelPath");
{% endhighlight %}
</div>
@@ -105,6 +111,8 @@ smoothing parameter `lambda` as input, and output a
[NaiveBayesModel](api/python/pyspark.mllib.classification.NaiveBayesModel-class.html), which can be
used for evaluation and prediction.
+Note that the Python API does not yet support model save/load but will in the future.
+
<!-- TODO: Make Python's example consistent with Scala's and Java's. -->
{% highlight python %}
from pyspark.mllib.regression import LabeledPoint
diff --git a/python/pyspark/mllib/feature.py b/python/pyspark/mllib/feature.py
index 10df628806..0ffe092a07 100644
--- a/python/pyspark/mllib/feature.py
+++ b/python/pyspark/mllib/feature.py
@@ -58,7 +58,8 @@ class Normalizer(VectorTransformer):
For any 1 <= `p` < float('inf'), normalizes samples using
sum(abs(vector) :sup:`p`) :sup:`(1/p)` as norm.
- For `p` = float('inf'), max(abs(vector)) will be used as norm for normalization.
+ For `p` = float('inf'), max(abs(vector)) will be used as norm for
+ normalization.
>>> v = Vectors.dense(range(3))
>>> nor = Normalizer(1)
@@ -120,9 +121,14 @@ class StandardScalerModel(JavaVectorTransformer):
"""
Applies standardization transformation on a vector.
+ Note: In Python, transform cannot currently be used within
+ an RDD transformation or action.
+ Call transform directly on the RDD instead.
+
:param vector: Vector or RDD of Vector to be standardized.
- :return: Standardized vector. If the variance of a column is zero,
- it will return default `0.0` for the column with zero variance.
+ :return: Standardized vector. If the variance of a column is
+ zero, it will return default `0.0` for the column with
+ zero variance.
"""
return JavaVectorTransformer.transform(self, vector)
@@ -148,9 +154,10 @@ class StandardScaler(object):
"""
:param withMean: False by default. Centers the data with mean
before scaling. It will build a dense output, so this
- does not work on sparse input and will raise an exception.
- :param withStd: True by default. Scales the data to unit standard
- deviation.
+ does not work on sparse input and will raise an
+ exception.
+ :param withStd: True by default. Scales the data to unit
+ standard deviation.
"""
if not (withMean or withStd):
warnings.warn("Both withMean and withStd are false. The model does nothing.")
@@ -159,10 +166,11 @@ class StandardScaler(object):
def fit(self, dataset):
"""
- Computes the mean and variance and stores as a model to be used for later scaling.
+ Computes the mean and variance and stores as a model to be used
+ for later scaling.
- :param data: The data used to compute the mean and variance to build
- the transformation model.
+ :param data: The data used to compute the mean and variance
+ to build the transformation model.
:return: a StandardScalarModel
"""
dataset = dataset.map(_convert_to_vector)
@@ -174,7 +182,8 @@ class HashingTF(object):
"""
.. note:: Experimental
- Maps a sequence of terms to their term frequencies using the hashing trick.
+ Maps a sequence of terms to their term frequencies using the hashing
+ trick.
Note: the terms must be hashable (can not be dict/set/list...).
@@ -195,8 +204,9 @@ class HashingTF(object):
def transform(self, document):
"""
- Transforms the input document (list of terms) to term frequency vectors,
- or transform the RDD of document to RDD of term frequency vectors.
+ Transforms the input document (list of terms) to term frequency
+ vectors, or transform the RDD of document to RDD of term
+ frequency vectors.
"""
if isinstance(document, RDD):
return document.map(self.transform)
@@ -220,7 +230,12 @@ class IDFModel(JavaVectorTransformer):
the terms which occur in fewer than `minDocFreq`
documents will have an entry of 0.
- :param x: an RDD of term frequency vectors or a term frequency vector
+ Note: In Python, transform cannot currently be used within
+ an RDD transformation or action.
+ Call transform directly on the RDD instead.
+
+ :param x: an RDD of term frequency vectors or a term frequency
+ vector
:return: an RDD of TF-IDF vectors or a TF-IDF vector
"""
if isinstance(x, RDD):
@@ -241,9 +256,9 @@ class IDF(object):
of documents that contain term `t`.
This implementation supports filtering out terms which do not appear
- in a minimum number of documents (controlled by the variable `minDocFreq`).
- For terms that are not in at least `minDocFreq` documents, the IDF is
- found as 0, resulting in TF-IDFs of 0.
+ in a minimum number of documents (controlled by the variable
+ `minDocFreq`). For terms that are not in at least `minDocFreq`
+ documents, the IDF is found as 0, resulting in TF-IDFs of 0.
>>> n = 4
>>> freqs = [Vectors.sparse(n, (1, 3), (1.0, 2.0)),
@@ -325,15 +340,16 @@ class Word2Vec(object):
The vector representation can be used as features in
natural language processing and machine learning algorithms.
- We used skip-gram model in our implementation and hierarchical softmax
- method to train the model. The variable names in the implementation
- matches the original C implementation.
+ We used skip-gram model in our implementation and hierarchical
+ softmax method to train the model. The variable names in the
+ implementation matches the original C implementation.
- For original C implementation, see https://code.google.com/p/word2vec/
+ For original C implementation,
+ see https://code.google.com/p/word2vec/
For research papers, see
Efficient Estimation of Word Representations in Vector Space
- and
- Distributed Representations of Words and Phrases and their Compositionality.
+ and Distributed Representations of Words and Phrases and their
+ Compositionality.
>>> sentence = "a b " * 100 + "a c " * 10
>>> localDoc = [sentence, sentence]
@@ -374,15 +390,16 @@ class Word2Vec(object):
def setNumPartitions(self, numPartitions):
"""
- Sets number of partitions (default: 1). Use a small number for accuracy.
+ Sets number of partitions (default: 1). Use a small number for
+ accuracy.
"""
self.numPartitions = numPartitions
return self
def setNumIterations(self, numIterations):
"""
- Sets number of iterations (default: 1), which should be smaller than or equal to number of
- partitions.
+ Sets number of iterations (default: 1), which should be smaller
+ than or equal to number of partitions.
"""
self.numIterations = numIterations
return self
diff --git a/python/pyspark/mllib/regression.py b/python/pyspark/mllib/regression.py
index 21751cc68f..66617abb85 100644
--- a/python/pyspark/mllib/regression.py
+++ b/python/pyspark/mllib/regression.py
@@ -31,8 +31,11 @@ class LabeledPoint(object):
The features and labels of a data point.
:param label: Label for this data point.
- :param features: Vector of features for this point (NumPy array, list,
- pyspark.mllib.linalg.SparseVector, or scipy.sparse column matrix)
+ :param features: Vector of features for this point (NumPy array,
+ list, pyspark.mllib.linalg.SparseVector, or scipy.sparse
+ column matrix)
+
+ Note: 'label' and 'features' are accessible as class attributes.
"""
def __init__(self, label, features):
diff --git a/python/pyspark/mllib/tree.py b/python/pyspark/mllib/tree.py
index 02d551b87d..73618f0449 100644
--- a/python/pyspark/mllib/tree.py
+++ b/python/pyspark/mllib/tree.py
@@ -33,6 +33,10 @@ class TreeEnsembleModel(JavaModelWrapper):
"""
Predict values for a single data point or an RDD of points using
the model trained.
+
+ Note: In Python, predict cannot currently be used within an RDD
+ transformation or action.
+ Call predict directly on the RDD instead.
"""
if isinstance(x, RDD):
return self.call("predict", x.map(_convert_to_vector))
@@ -48,7 +52,8 @@ class TreeEnsembleModel(JavaModelWrapper):
def totalNumNodes(self):
"""
- Get total number of nodes, summed over all trees in the ensemble.
+ Get total number of nodes, summed over all trees in the
+ ensemble.
"""
return self.call("totalNumNodes")
@@ -71,6 +76,10 @@ class DecisionTreeModel(JavaModelWrapper):
"""
Predict the label of one or more examples.
+ Note: In Python, predict cannot currently be used within an RDD
+ transformation or action.
+ Call predict directly on the RDD instead.
+
:param x: Data point (feature vector),
or an RDD of data points (feature vectors).
"""
@@ -99,7 +108,8 @@ class DecisionTree(object):
"""
.. note:: Experimental
- Learning algorithm for a decision tree model for classification or regression.
+ Learning algorithm for a decision tree model for classification or
+ regression.
"""
@classmethod
@@ -176,17 +186,17 @@ class DecisionTree(object):
:param data: Training data: RDD of LabeledPoint.
Labels are real numbers.
- :param categoricalFeaturesInfo: Map from categorical feature index
- to number of categories.
- Any feature not in this map
- is treated as continuous.
+ :param categoricalFeaturesInfo: Map from categorical feature
+ index to number of categories.
+ Any feature not in this map is treated as continuous.
:param impurity: Supported values: "variance"
:param maxDepth: Max depth of tree.
- E.g., depth 0 means 1 leaf node.
- Depth 1 means 1 internal node + 2 leaf nodes.
- :param maxBins: Number of bins used for finding splits at each node.
- :param minInstancesPerNode: Min number of instances required at child
- nodes to create the parent split
+ E.g., depth 0 means 1 leaf node.
+ Depth 1 means 1 internal node + 2 leaf nodes.
+ :param maxBins: Number of bins used for finding splits at each
+ node.
+ :param minInstancesPerNode: Min number of instances required at
+ child nodes to create the parent split
:param minInfoGain: Min info gain required to create a split
:return: DecisionTreeModel
@@ -229,7 +239,8 @@ class RandomForest(object):
"""
.. note:: Experimental
- Learning algorithm for a random forest model for classification or regression.
+ Learning algorithm for a random forest model for classification or
+ regression.
"""
supportedFeatureSubsetStrategies = ("auto", "all", "sqrt", "log2", "onethird")
@@ -256,26 +267,33 @@ class RandomForest(object):
Method to train a decision tree model for binary or multiclass
classification.
- :param data: Training dataset: RDD of LabeledPoint. Labels should take
- values {0, 1, ..., numClasses-1}.
+ :param data: Training dataset: RDD of LabeledPoint. Labels
+ should take values {0, 1, ..., numClasses-1}.
:param numClasses: number of classes for classification.
- :param categoricalFeaturesInfo: Map storing arity of categorical features.
- E.g., an entry (n -> k) indicates that feature n is categorical
- with k categories indexed from 0: {0, 1, ..., k-1}.
+ :param categoricalFeaturesInfo: Map storing arity of categorical
+ features. E.g., an entry (n -> k) indicates that
+ feature n is categorical with k categories indexed
+ from 0: {0, 1, ..., k-1}.
:param numTrees: Number of trees in the random forest.
- :param featureSubsetStrategy: Number of features to consider for splits at
- each node.
- Supported: "auto" (default), "all", "sqrt", "log2", "onethird".
- If "auto" is set, this parameter is set based on numTrees:
- if numTrees == 1, set to "all";
- if numTrees > 1 (forest) set to "sqrt".
- :param impurity: Criterion used for information gain calculation.
+ :param featureSubsetStrategy: Number of features to consider for
+ splits at each node.
+ Supported: "auto" (default), "all", "sqrt", "log2",
+ "onethird".
+ If "auto" is set, this parameter is set based on
+ numTrees:
+ if numTrees == 1, set to "all";
+ if numTrees > 1 (forest) set to "sqrt".
+ :param impurity: Criterion used for information gain
+ calculation.
Supported values: "gini" (recommended) or "entropy".
- :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node;
- depth 1 means 1 internal node + 2 leaf nodes. (default: 4)
- :param maxBins: maximum number of bins used for splitting features
+ :param maxDepth: Maximum depth of the tree.
+ E.g., depth 0 means 1 leaf node; depth 1 means
+ 1 internal node + 2 leaf nodes. (default: 4)
+ :param maxBins: maximum number of bins used for splitting
+ features
(default: 100)
- :param seed: Random seed for bootstrapping and choosing feature subsets.
+ :param seed: Random seed for bootstrapping and choosing feature
+ subsets.
:return: RandomForestModel that can be used for prediction
Example usage:
@@ -337,19 +355,24 @@ class RandomForest(object):
{0, 1, ..., k-1}.
:param numTrees: Number of trees in the random forest.
:param featureSubsetStrategy: Number of features to consider for
- splits at each node.
- Supported: "auto" (default), "all", "sqrt", "log2", "onethird".
- If "auto" is set, this parameter is set based on numTrees:
- if numTrees == 1, set to "all";
- if numTrees > 1 (forest) set to "onethird" for regression.
- :param impurity: Criterion used for information gain calculation.
- Supported values: "variance".
- :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1
- leaf node; depth 1 means 1 internal node + 2 leaf nodes.
- (default: 4)
- :param maxBins: maximum number of bins used for splitting features
- (default: 100)
- :param seed: Random seed for bootstrapping and choosing feature subsets.
+ splits at each node.
+ Supported: "auto" (default), "all", "sqrt", "log2",
+ "onethird".
+ If "auto" is set, this parameter is set based on
+ numTrees:
+ if numTrees == 1, set to "all";
+ if numTrees > 1 (forest) set to "onethird" for
+ regression.
+ :param impurity: Criterion used for information gain
+ calculation.
+ Supported values: "variance".
+ :param maxDepth: Maximum depth of the tree. E.g., depth 0 means
+ 1 leaf node; depth 1 means 1 internal node + 2 leaf
+ nodes. (default: 4)
+ :param maxBins: maximum number of bins used for splitting
+ features (default: 100)
+ :param seed: Random seed for bootstrapping and choosing feature
+ subsets.
:return: RandomForestModel that can be used for prediction
Example usage:
@@ -395,7 +418,8 @@ class GradientBoostedTrees(object):
"""
.. note:: Experimental
- Learning algorithm for a gradient boosted trees model for classification or regression.
+ Learning algorithm for a gradient boosted trees model for
+ classification or regression.
"""
@classmethod
@@ -411,24 +435,29 @@ class GradientBoostedTrees(object):
def trainClassifier(cls, data, categoricalFeaturesInfo,
loss="logLoss", numIterations=100, learningRate=0.1, maxDepth=3):
"""
- Method to train a gradient-boosted trees model for classification.
+ Method to train a gradient-boosted trees model for
+ classification.
- :param data: Training dataset: RDD of LabeledPoint. Labels should take values {0, 1}.
+ :param data: Training dataset: RDD of LabeledPoint.
+ Labels should take values {0, 1}.
:param categoricalFeaturesInfo: Map storing arity of categorical
features. E.g., an entry (n -> k) indicates that feature
n is categorical with k categories indexed from 0:
{0, 1, ..., k-1}.
- :param loss: Loss function used for minimization during gradient boosting.
- Supported: {"logLoss" (default), "leastSquaresError", "leastAbsoluteError"}.
+ :param loss: Loss function used for minimization during gradient
+ boosting. Supported: {"logLoss" (default),
+ "leastSquaresError", "leastAbsoluteError"}.
:param numIterations: Number of iterations of boosting.
(default: 100)
- :param learningRate: Learning rate for shrinking the contribution of each estimator.
- The learning rate should be between in the interval (0, 1]
- (default: 0.1)
- :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1
- leaf node; depth 1 means 1 internal node + 2 leaf nodes.
- (default: 3)
- :return: GradientBoostedTreesModel that can be used for prediction
+ :param learningRate: Learning rate for shrinking the
+ contribution of each estimator. The learning rate
+ should be between in the interval (0, 1].
+ (default: 0.1)
+ :param maxDepth: Maximum depth of the tree. E.g., depth 0 means
+ 1 leaf node; depth 1 means 1 internal node + 2 leaf
+ nodes. (default: 3)
+ :return: GradientBoostedTreesModel that can be used for
+ prediction
Example usage:
@@ -472,17 +501,20 @@ class GradientBoostedTrees(object):
features. E.g., an entry (n -> k) indicates that feature
n is categorical with k categories indexed from 0:
{0, 1, ..., k-1}.
- :param loss: Loss function used for minimization during gradient boosting.
- Supported: {"logLoss" (default), "leastSquaresError", "leastAbsoluteError"}.
+ :param loss: Loss function used for minimization during gradient
+ boosting. Supported: {"logLoss" (default),
+ "leastSquaresError", "leastAbsoluteError"}.
:param numIterations: Number of iterations of boosting.
(default: 100)
- :param learningRate: Learning rate for shrinking the contribution of each estimator.
- The learning rate should be between in the interval (0, 1]
- (default: 0.1)
- :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1
- leaf node; depth 1 means 1 internal node + 2 leaf nodes.
- (default: 3)
- :return: GradientBoostedTreesModel that can be used for prediction
+ :param learningRate: Learning rate for shrinking the
+ contribution of each estimator. The learning rate
+ should be between in the interval (0, 1].
+ (default: 0.1)
+ :param maxDepth: Maximum depth of the tree. E.g., depth 0 means
+ 1 leaf node; depth 1 means 1 internal node + 2 leaf
+ nodes. (default: 3)
+ :return: GradientBoostedTreesModel that can be used for
+ prediction
Example usage: