aboutsummaryrefslogtreecommitdiff
path: root/docs/mllib-clustering.md
diff options
context:
space:
mode:
authorEmre Sevinç <emre.sevinc@gmail.com>2015-02-13 12:31:27 -0800
committerXiangrui Meng <meng@databricks.com>2015-02-13 12:31:27 -0800
commit9f31db061019414a964aac432e946eac61f8307c (patch)
tree631cf0b821109acb404455033af6833b5885e0a9 /docs/mllib-clustering.md
parent077eec2d9dba197f51004ee4a322d0fa71424ea0 (diff)
downloadspark-9f31db061019414a964aac432e946eac61f8307c.tar.gz
spark-9f31db061019414a964aac432e946eac61f8307c.tar.bz2
spark-9f31db061019414a964aac432e946eac61f8307c.zip
SPARK-5805 Fixed the type error in documentation.
Fixes SPARK-5805 : Fix the type error in the final example given in MLlib - Clustering documentation. Author: Emre Sevinç <emre.sevinc@gmail.com> Closes #4596 from emres/SPARK-5805 and squashes the following commits: 1029f66 [Emre Sevinç] SPARK-5805 Fixed the type error in documentation.
Diffstat (limited to 'docs/mllib-clustering.md')
-rw-r--r--docs/mllib-clustering.md62
1 files changed, 31 insertions, 31 deletions
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index 99ed6b60e3..0450e6fd3e 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -14,7 +14,7 @@ Clustering is an unsupervised learning problem whereby we aim to group subsets
of entities with one another based on some notion of similarity. Clustering is
often used for exploratory analysis and/or as a component of a hierarchical
supervised learning pipeline (in which distinct classifiers or regression
-models are trained for each cluster).
+models are trained for each cluster).
MLlib supports the following models:
@@ -25,7 +25,7 @@ most commonly used clustering algorithms that clusters the data points into a
predefined number of clusters. The MLlib implementation includes a parallelized
variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method
called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
-The implementation in MLlib has the following parameters:
+The implementation in MLlib has the following parameters:
* *k* is the number of desired clusters.
* *maxIterations* is the maximum number of iterations to run.
@@ -35,12 +35,12 @@ initialization via k-means\|\|.
guaranteed to find a globally optimal solution, and when run multiple times on
a given dataset, the algorithm returns the best clustering result).
* *initializationSteps* determines the number of steps in the k-means\|\| algorithm.
-* *epsilon* determines the distance threshold within which we consider k-means to have converged.
+* *epsilon* determines the distance threshold within which we consider k-means to have converged.
### Gaussian mixture
A [Gaussian Mixture Model](http://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model)
-represents a composite distribution whereby points are drawn from one of *k* Gaussian sub-distributions,
+represents a composite distribution whereby points are drawn from one of *k* Gaussian sub-distributions,
each with its own probability. The MLlib implementation uses the
[expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
algorithm to induce the maximum-likelihood model given a set of samples. The implementation
@@ -221,8 +221,8 @@ print("Within Set Sum of Squared Error = " + str(WSSSE))
<div class="codetabs">
<div data-lang="scala" markdown="1">
In the following example after loading and parsing data, we use a
-[GaussianMixture](api/scala/index.html#org.apache.spark.mllib.clustering.GaussianMixture)
-object to cluster the data into two clusters. The number of desired clusters is passed
+[GaussianMixture](api/scala/index.html#org.apache.spark.mllib.clustering.GaussianMixture)
+object to cluster the data into two clusters. The number of desired clusters is passed
to the algorithm. We then output the parameters of the mixture model.
{% highlight scala %}
@@ -238,7 +238,7 @@ val gmm = new GaussianMixture().setK(2).run(parsedData)
// output parameters of max-likelihood model
for (i <- 0 until gmm.k) {
- println("weight=%f\nmu=%s\nsigma=\n%s\n" format
+ println("weight=%f\nmu=%s\nsigma=\n%s\n" format
(gmm.weights(i), gmm.gaussians(i).mu, gmm.gaussians(i).sigma))
}
@@ -298,7 +298,7 @@ public class GaussianMixtureExample {
<div data-lang="python" markdown="1">
In the following example after loading and parsing data, we use a
[GaussianMixture](api/python/pyspark.mllib.html#pyspark.mllib.clustering.GaussianMixture)
-object to cluster the data into two clusters. The number of desired clusters is passed
+object to cluster the data into two clusters. The number of desired clusters is passed
to the algorithm. We then output the parameters of the mixture model.
{% highlight python %}
@@ -326,7 +326,7 @@ for i in range(2):
In the following example, we load word count vectors representing a corpus of documents.
We then use [LDA](api/scala/index.html#org.apache.spark.mllib.clustering.LDA)
-to infer three topics from the documents. The number of desired clusters is passed
+to infer three topics from the documents. The number of desired clusters is passed
to the algorithm. We then output the topics, represented as probability distributions over words.
<div class="codetabs">
@@ -428,27 +428,27 @@ a dependency.
## Streaming clustering
-When data arrive in a stream, we may want to estimate clusters dynamically,
-updating them as new data arrive. MLlib provides support for streaming k-means clustering,
-with parameters to control the decay (or "forgetfulness") of the estimates. The algorithm
-uses a generalization of the mini-batch k-means update rule. For each batch of data, we assign
+When data arrive in a stream, we may want to estimate clusters dynamically,
+updating them as new data arrive. MLlib provides support for streaming k-means clustering,
+with parameters to control the decay (or "forgetfulness") of the estimates. The algorithm
+uses a generalization of the mini-batch k-means update rule. For each batch of data, we assign
all points to their nearest cluster, compute new cluster centers, then update each cluster using:
`\begin{equation}
c_{t+1} = \frac{c_tn_t\alpha + x_tm_t}{n_t\alpha+m_t}
\end{equation}`
`\begin{equation}
- n_{t+1} = n_t + m_t
+ n_{t+1} = n_t + m_t
\end{equation}`
-Where `$c_t$` is the previous center for the cluster, `$n_t$` is the number of points assigned
-to the cluster thus far, `$x_t$` is the new cluster center from the current batch, and `$m_t$`
-is the number of points added to the cluster in the current batch. The decay factor `$\alpha$`
-can be used to ignore the past: with `$\alpha$=1` all data will be used from the beginning;
-with `$\alpha$=0` only the most recent data will be used. This is analogous to an
-exponentially-weighted moving average.
+Where `$c_t$` is the previous center for the cluster, `$n_t$` is the number of points assigned
+to the cluster thus far, `$x_t$` is the new cluster center from the current batch, and `$m_t$`
+is the number of points added to the cluster in the current batch. The decay factor `$\alpha$`
+can be used to ignore the past: with `$\alpha$=1` all data will be used from the beginning;
+with `$\alpha$=0` only the most recent data will be used. This is analogous to an
+exponentially-weighted moving average.
-The decay can be specified using a `halfLife` parameter, which determines the
+The decay can be specified using a `halfLife` parameter, which determines the
correct decay factor `a` such that, for data acquired
at time `t`, its contribution by time `t + halfLife` will have dropped to 0.5.
The unit of time can be specified either as `batches` or `points` and the update rule
@@ -472,9 +472,9 @@ import org.apache.spark.mllib.clustering.StreamingKMeans
{% endhighlight %}
-Then we make an input stream of vectors for training, as well as a stream of labeled data
-points for testing. We assume a StreamingContext `ssc` has been created, see
-[Spark Streaming Programming Guide](streaming-programming-guide.html#initializing) for more info.
+Then we make an input stream of vectors for training, as well as a stream of labeled data
+points for testing. We assume a StreamingContext `ssc` has been created, see
+[Spark Streaming Programming Guide](streaming-programming-guide.html#initializing) for more info.
{% highlight scala %}
@@ -496,24 +496,24 @@ val model = new StreamingKMeans()
{% endhighlight %}
-Now register the streams for training and testing and start the job, printing
+Now register the streams for training and testing and start the job, printing
the predicted cluster assignments on new data points as they arrive.
{% highlight scala %}
model.trainOn(trainingData)
-model.predictOnValues(testData).print()
+model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
ssc.start()
ssc.awaitTermination()
-
+
{% endhighlight %}
-As you add new text files with data the cluster centers will update. Each training
+As you add new text files with data the cluster centers will update. Each training
point should be formatted as `[x1, x2, x3]`, and each test data point
-should be formatted as `(y, [x1, x2, x3])`, where `y` is some useful label or identifier
-(e.g. a true category assignment). Anytime a text file is placed in `/training/data/dir`
-the model will update. Anytime a text file is placed in `/testing/data/dir`
+should be formatted as `(y, [x1, x2, x3])`, where `y` is some useful label or identifier
+(e.g. a true category assignment). Anytime a text file is placed in `/training/data/dir`
+the model will update. Anytime a text file is placed in `/testing/data/dir`
you will see predictions. With new data, the cluster centers will change!
</div>