aboutsummaryrefslogtreecommitdiff
path: root/mllib
diff options
context:
space:
mode:
authorYu ISHIKAWA <yuu.ishikawa@gmail.com>2015-07-14 23:27:42 -0700
committerJoseph K. Bradley <joseph@databricks.com>2015-07-14 23:27:42 -0700
commit4692769655e09d129a62a89a8ffb5d635675aa4d (patch)
treeb89ab2920c77ba44ad9897cbe6b524195b899820 /mllib
parentc6b1a9e74e34267dc198e57a184c41498ca9d6a3 (diff)
downloadspark-4692769655e09d129a62a89a8ffb5d635675aa4d.tar.gz
spark-4692769655e09d129a62a89a8ffb5d635675aa4d.tar.bz2
spark-4692769655e09d129a62a89a8ffb5d635675aa4d.zip
[SPARK-6259] [MLLIB] Python API for LDA
I implemented the Python API for LDA. But I didn't implemented a method for `LDAModel.describeTopics()`, beause it's a little hard to implement it now. And adding document about that and an example code would fit for another issue. TODO: LDAModel.describeTopics() in Python must be also implemented. But it would be nice to fit for another issue. Implementing it is a little hard, since the return value of `describeTopics` in Scala consists of Tuple classes. Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #6791 from yu-iskw/SPARK-6259 and squashes the following commits: 6855f59 [Yu ISHIKAWA] LDA inherits object 28bd165 [Yu ISHIKAWA] Change the place of testing code d7a332a [Yu ISHIKAWA] Remove the doc comment about the optimizer's default value 083e226 [Yu ISHIKAWA] Add the comment about the supported values and the default value of `optimizer` 9f8bed8 [Yu ISHIKAWA] Simplify casting faa9764 [Yu ISHIKAWA] Add some comments for the LDA paramters 98f645a [Yu ISHIKAWA] Remove the interface for `describeTopics`. Because it is not implemented. 57ac03d [Yu ISHIKAWA] Remove the unnecessary import in Python unit testing 73412c3 [Yu ISHIKAWA] Fix the typo 2278829 [Yu ISHIKAWA] Fix the indentation 39514ec [Yu ISHIKAWA] Modify how to cast the input data 8117e18 [Yu ISHIKAWA] Fix the validation problems by `lint-scala` 77fd1b7 [Yu ISHIKAWA] Not use LabeledPoint 68f0653 [Yu ISHIKAWA] Support some parameters for `ALS.train()` in Python 25ef2ac [Yu ISHIKAWA] Resolve conflicts with rebasing
Diffstat (limited to 'mllib')
-rw-r--r--mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala33
1 files changed, 33 insertions, 0 deletions
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
index e628059c4a..c58a64001d 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
@@ -503,6 +503,39 @@ private[python] class PythonMLLibAPI extends Serializable {
}
/**
+ * Java stub for Python mllib LDA.run()
+ */
+ def trainLDAModel(
+ data: JavaRDD[java.util.List[Any]],
+ k: Int,
+ maxIterations: Int,
+ docConcentration: Double,
+ topicConcentration: Double,
+ seed: java.lang.Long,
+ checkpointInterval: Int,
+ optimizer: String): LDAModel = {
+ val algo = new LDA()
+ .setK(k)
+ .setMaxIterations(maxIterations)
+ .setDocConcentration(docConcentration)
+ .setTopicConcentration(topicConcentration)
+ .setCheckpointInterval(checkpointInterval)
+ .setOptimizer(optimizer)
+
+ if (seed != null) algo.setSeed(seed)
+
+ val documents = data.rdd.map(_.asScala.toArray).map { r =>
+ r(0) match {
+ case i: java.lang.Integer => (i.toLong, r(1).asInstanceOf[Vector])
+ case i: java.lang.Long => (i.toLong, r(1).asInstanceOf[Vector])
+ case _ => throw new IllegalArgumentException("input values contains invalid type value.")
+ }
+ }
+ algo.run(documents)
+ }
+
+
+ /**
* Java stub for Python mllib FPGrowth.train(). This stub returns a handle
* to the Java object instead of the content of the Java object. Extra care
* needs to be taken in the Python code to ensure it gets freed on exit; see