[SPARK-5900][MLLIB] make PIC and FPGrowth Java-friendly

In the previous version, PIC stores clustering assignments as an `RDD[(Long, Int)]`. This is mapped to `RDD<Tuple2<Object, Object>>` in Java and hence Java users have to cast types manually. We should either create a new method called `javaAssignments` that returns `JavaRDD[(java.lang.Long, java.lang.Int)]` or wrap the result pair in a class. I chose the latter approach in this PR. Now assignments are stored as an `RDD[Assignment]`, where `Assignment` is a class with `id` and `cluster`. Similarly, in FPGrowth, the frequent itemsets are stored as an `RDD[(Array[Item], Long)]`, which is mapped to `RDD<Tuple2<Object, Object>>`. Though we provide a "Java-friendly" method `javaFreqItemsets` that returns `JavaRDD[(Array[Item], java.lang.Long)]`. It doesn't really work because `Array[Item]` is mapped to `Object` in Java. So in this PR I created a class `FreqItemset` to wrap the results. It has `items` and `freq`, as well as a `javaItems` method that returns `List<Item>` in Java. I'm not certain that the names I chose are proper: `Assignment`/`id`/`cluster` and `FreqItemset`/`items`/`freq`. Please let me know if there are better suggestions. CC: jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #4695 from mengxr/SPARK-5900 and squashes the following commits: 865b5ca [Xiangrui Meng] make Assignment serializable cffa96e [Xiangrui Meng] fix test 9c0e590 [Xiangrui Meng] remove unused Tuple2 1b9db3d [Xiangrui Meng] make PIC and FPGrowth Java-friendly
author: Xiangrui Meng <meng@databricks.com> 2015-02-19 18:06:16 -0800
committer: Xiangrui Meng <meng@databricks.com> 2015-02-19 18:06:16 -0800
commit: 0cfd2cebde0b7fac3779eda80d6e42223f8a3d9f (patch)
tree: 36bdfdec69a205b85f7b85697c36abf2044d9ff5 /docs/mllib-frequent-pattern-mining.md
parent: 6bddc40353057a562c78e75c5549c79a0d7d5f8b (diff)
download: spark-0cfd2cebde0b7fac3779eda80d6e42223f8a3d9f.tar.gz
spark-0cfd2cebde0b7fac3779eda80d6e42223f8a3d9f.tar.bz2
spark-0cfd2cebde0b7fac3779eda80d6e42223f8a3d9f.zip
1 files changed, 5 insertions, 7 deletions
diff --git a/docs/mllib-frequent-pattern-mining.md b/docs/mllib-frequent-pattern-mining.md
index 0ff9738768..9fd9be0dd0 100644
--- a/docs/mllib-frequent-pattern-mining.md
+++ b/docs/mllib-frequent-pattern-mining.md
@@ -57,8 +57,8 @@ val fpg = new FPGrowth()
   .setNumPartitions(10)
 val model = fpg.run(transactions)
 
-model.freqItemsets.collect().foreach { case (itemset, freq) =>
-  println(itemset.mkString("[", ",", "]") + ", " + freq)
+model.freqItemsets.collect().foreach { itemset =>
+  println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)
 }
 {% endhighlight %}
 
@@ -74,10 +74,9 @@ Calling `FPGrowth.run` with transactions returns an
 that stores the frequent itemsets with their frequencies.
 
 {% highlight java %}
-import java.util.Arrays;
 import java.util.List;
 
-import scala.Tuple2;
+import com.google.common.base.Joiner;
 
 import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.mllib.fpm.FPGrowth;
@@ -88,11 +87,10 @@ JavaRDD<List<String>> transactions = ...
 FPGrowth fpg = new FPGrowth()
   .setMinSupport(0.2)
   .setNumPartitions(10);
-
 FPGrowthModel<String> model = fpg.run(transactions);
 
-for (Tuple2<Object, Long> s: model.javaFreqItemsets().collect()) {
-   System.out.println("(" + Arrays.toString((Object[]) s._1()) + "): " + s._2());
+for (FPGrowth.FreqItemset<String> itemset: model.freqItemsets().toJavaRDD().collect()) {
+   System.out.println("[" + Joiner.on(",").join(s.javaItems()) + "], " + s.freq());
 }
 {% endhighlight %}
author	Xiangrui Meng <meng@databricks.com>	2015-02-19 18:06:16 -0800
committer	Xiangrui Meng <meng@databricks.com>	2015-02-19 18:06:16 -0800
commit	0cfd2cebde0b7fac3779eda80d6e42223f8a3d9f (patch)
tree	36bdfdec69a205b85f7b85697c36abf2044d9ff5 /docs/mllib-frequent-pattern-mining.md
parent	6bddc40353057a562c78e75c5549c79a0d7d5f8b (diff)
download	spark-0cfd2cebde0b7fac3779eda80d6e42223f8a3d9f.tar.gz spark-0cfd2cebde0b7fac3779eda80d6e42223f8a3d9f.tar.bz2 spark-0cfd2cebde0b7fac3779eda80d6e42223f8a3d9f.zip