aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorXiangrui Meng <meng@databricks.com>2015-02-08 23:40:36 -0800
committerXiangrui Meng <meng@databricks.com>2015-02-08 23:40:36 -0800
commit855d12ac0a9cdade4cd2cc64c4e7209478be6690 (patch)
tree0316f00abcadb9f1b357f93c831f123288066126 /docs
parent4575c5643a82818bf64f9648314bdc2fdc12febb (diff)
downloadspark-855d12ac0a9cdade4cd2cc64c4e7209478be6690.tar.gz
spark-855d12ac0a9cdade4cd2cc64c4e7209478be6690.tar.bz2
spark-855d12ac0a9cdade4cd2cc64c4e7209478be6690.zip
[SPARK-5539][MLLIB] LDA guide
This is the LDA user guide from jkbradley with Java and Scala code example. Author: Xiangrui Meng <meng@databricks.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #4465 from mengxr/lda-guide and squashes the following commits: 6dcb7d1 [Xiangrui Meng] update java example in the user guide 76169ff [Xiangrui Meng] update java example 36c3ae2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into lda-guide c2a1efe [Joseph K. Bradley] Added LDA programming guide, plus Java example (which is in the guide and probably should be removed).
Diffstat (limited to 'docs')
-rw-r--r--docs/mllib-clustering.md129
1 files changed, 128 insertions, 1 deletions
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index 1e9ef345b7..99ed6b60e3 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -55,7 +55,7 @@ has the following parameters:
Power iteration clustering is a scalable and efficient algorithm for clustering points given pointwise mutual affinity values. Internally the algorithm:
-* accepts a [Graph](https://spark.apache.org/docs/0.9.2/api/graphx/index.html#org.apache.spark.graphx.Graph) that represents a normalized pairwise affinity between all input points.
+* accepts a [Graph](api/graphx/index.html#org.apache.spark.graphx.Graph) that represents a normalized pairwise affinity between all input points.
* calculates the principal eigenvalue and eigenvector
* Clusters each of the input points according to their principal eigenvector component value
@@ -71,6 +71,35 @@ Example outputs for a dataset inspired by the paper - but with five clusters ins
<!-- Images are downsized intentionally to improve quality on retina displays -->
</p>
+### Latent Dirichlet Allocation (LDA)
+
+[Latent Dirichlet Allocation (LDA)](http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)
+is a topic model which infers topics from a collection of text documents.
+LDA can be thought of as a clustering algorithm as follows:
+
+* Topics correspond to cluster centers, and documents correspond to examples (rows) in a dataset.
+* Topics and documents both exist in a feature space, where feature vectors are vectors of word counts.
+* Rather than estimating a clustering using a traditional distance, LDA uses a function based
+ on a statistical model of how text documents are generated.
+
+LDA takes in a collection of documents as vectors of word counts.
+It learns clustering using [expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
+on the likelihood function. After fitting on the documents, LDA provides:
+
+* Topics: Inferred topics, each of which is a probability distribution over terms (words).
+* Topic distributions for documents: For each document in the training set, LDA gives a probability distribution over topics.
+
+LDA takes the following parameters:
+
+* `k`: Number of topics (i.e., cluster centers)
+* `maxIterations`: Limit on the number of iterations of EM used for learning
+* `docConcentration`: Hyperparameter for prior over documents' distributions over topics. Currently must be > 1, where larger values encourage smoother inferred distributions.
+* `topicConcentration`: Hyperparameter for prior over topics' distributions over terms (words). Currently must be > 1, where larger values encourage smoother inferred distributions.
+* `checkpointInterval`: If using checkpointing (set in the Spark configuration), this parameter specifies the frequency with which checkpoints will be created. If `maxIterations` is large, using checkpointing can help reduce shuffle file sizes on disk and help with failure recovery.
+
+*Note*: LDA is a new feature with some missing functionality. In particular, it does not yet
+support prediction on new documents, and it does not have a Python API. These will be added in the future.
+
### Examples
#### k-means
@@ -293,6 +322,104 @@ for i in range(2):
</div>
+#### Latent Dirichlet Allocation (LDA) Example
+
+In the following example, we load word count vectors representing a corpus of documents.
+We then use [LDA](api/scala/index.html#org.apache.spark.mllib.clustering.LDA)
+to infer three topics from the documents. The number of desired clusters is passed
+to the algorithm. We then output the topics, represented as probability distributions over words.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+{% highlight scala %}
+import org.apache.spark.mllib.clustering.LDA
+import org.apache.spark.mllib.linalg.Vectors
+
+// Load and parse the data
+val data = sc.textFile("data/mllib/sample_lda_data.txt")
+val parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble)))
+// Index documents with unique IDs
+val corpus = parsedData.zipWithIndex.map(_.swap).cache()
+
+// Cluster the documents into three topics using LDA
+val ldaModel = new LDA().setK(3).run(corpus)
+
+// Output topics. Each is a distribution over words (matching word count vectors)
+println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize + " words):")
+val topics = ldaModel.topicsMatrix
+for (topic <- Range(0, 3)) {
+ print("Topic " + topic + ":")
+ for (word <- Range(0, ldaModel.vocabSize)) { print(" " + topics(word, topic)); }
+ println()
+}
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+{% highlight java %}
+import scala.Tuple2;
+
+import org.apache.spark.api.java.*;
+import org.apache.spark.api.java.function.Function;
+import org.apache.spark.mllib.clustering.DistributedLDAModel;
+import org.apache.spark.mllib.clustering.LDA;
+import org.apache.spark.mllib.linalg.Matrix;
+import org.apache.spark.mllib.linalg.Vector;
+import org.apache.spark.mllib.linalg.Vectors;
+import org.apache.spark.SparkConf;
+
+public class JavaLDAExample {
+ public static void main(String[] args) {
+ SparkConf conf = new SparkConf().setAppName("LDA Example");
+ JavaSparkContext sc = new JavaSparkContext(conf);
+
+ // Load and parse the data
+ String path = "data/mllib/sample_lda_data.txt";
+ JavaRDD<String> data = sc.textFile(path);
+ JavaRDD<Vector> parsedData = data.map(
+ new Function<String, Vector>() {
+ public Vector call(String s) {
+ String[] sarray = s.trim().split(" ");
+ double[] values = new double[sarray.length];
+ for (int i = 0; i < sarray.length; i++)
+ values[i] = Double.parseDouble(sarray[i]);
+ return Vectors.dense(values);
+ }
+ }
+ );
+ // Index documents with unique IDs
+ JavaPairRDD<Long, Vector> corpus = JavaPairRDD.fromJavaRDD(parsedData.zipWithIndex().map(
+ new Function<Tuple2<Vector, Long>, Tuple2<Long, Vector>>() {
+ public Tuple2<Long, Vector> call(Tuple2<Vector, Long> doc_id) {
+ return doc_id.swap();
+ }
+ }
+ ));
+ corpus.cache();
+
+ // Cluster the documents into three topics using LDA
+ DistributedLDAModel ldaModel = new LDA().setK(3).run(corpus);
+
+ // Output topics. Each is a distribution over words (matching word count vectors)
+ System.out.println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize()
+ + " words):");
+ Matrix topics = ldaModel.topicsMatrix();
+ for (int topic = 0; topic < 3; topic++) {
+ System.out.print("Topic " + topic + ":");
+ for (int word = 0; word < ldaModel.vocabSize(); word++) {
+ System.out.print(" " + topics.apply(word, topic));
+ }
+ System.out.println();
+ }
+ }
+}
+{% endhighlight %}
+</div>
+
+</div>
+
+
In order to run the above application, follow the instructions
provided in the [Self-Contained Applications](quick-start.html#self-contained-applications)
section of the Spark