[SPARK-3136][MLLIB] Create Java-friendly methods in RandomRDDs

Though we don't use default argument for methods in RandomRDDs, it is still not easy for Java users to use because the output type is either `RDD[Double]` or `RDD[Vector]`. Java users should expect `JavaDoubleRDD` and `JavaRDD[Vector]`, respectively. We should create dedicated methods for Java users, and allow default arguments in Scala methods in RandomRDDs, to make life easier for both Java and Scala users. This PR also contains documentation for random data generation. brkyvz Author: Xiangrui Meng <meng@databricks.com> Closes #2041 from mengxr/stat-doc and squashes the following commits: fc5eedf [Xiangrui Meng] add missing comma ffde810 [Xiangrui Meng] address comments aef6d07 [Xiangrui Meng] add doc for random data generation b99d94b [Xiangrui Meng] add java-friendly methods to RandomRDDs
author: Xiangrui Meng <meng@databricks.com> 2014-08-19 16:06:48 -0700
committer: Xiangrui Meng <meng@databricks.com> 2014-08-19 16:06:48 -0700
commit: 825d4fe47b9c4d48de88622dd48dcf83beb8b80a (patch)
tree: d51775e9f88bff51458e57a5ec16de6e0b93b91a /docs/mllib-stats.md
parent: d7e80c2597d4a9cae2e0cb35a86f7889323f4cbb (diff)
download: spark-825d4fe47b9c4d48de88622dd48dcf83beb8b80a.tar.gz
spark-825d4fe47b9c4d48de88622dd48dcf83beb8b80a.tar.bz2
spark-825d4fe47b9c4d48de88622dd48dcf83beb8b80a.zip
1 files changed, 73 insertions, 1 deletions
diff --git a/docs/mllib-stats.md b/docs/mllib-stats.md
index ca9ef46c15..f25dca746b 100644
--- a/docs/mllib-stats.md
+++ b/docs/mllib-stats.md
@@ -25,7 +25,79 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - Statistics Functionality
 \newcommand{\zero}{\mathbf{0}}
 \]`
 
-## Data Generators 
+## Random data generation
+
+Random data generation is useful for randomized algorithms, prototyping, and performance testing.
+MLlib supports generating random RDDs with i.i.d. values drawn from a given distribution:
+uniform, standard normal, or Poisson.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+[`RandomRDDs`](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs) provides factory
+methods to generate random double RDDs or vector RDDs.
+The following example generates a random double RDD, whose values follows the standard normal
+distribution `N(0, 1)`, and then map it to `N(1, 4)`.
+
+{% highlight scala %}
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.random.RandomRDDs._
+
+val sc: SparkContext = ...
+
+// Generate a random double RDD that contains 1 million i.i.d. values drawn from the
+// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
+val u = normalRDD(sc, 1000000L, 10)
+// Apply a transform to get a random double RDD following `N(1, 4)`.
+val v = u.map(x => 1.0 + 2.0 * x)
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+[`RandomRDDs`](api/java/index.html#org.apache.spark.mllib.random.RandomRDDs) provides factory
+methods to generate random double RDDs or vector RDDs.
+The following example generates a random double RDD, whose values follows the standard normal
+distribution `N(0, 1)`, and then map it to `N(1, 4)`.
+
+{% highlight java %}
+import org.apache.spark.SparkContext;
+import org.apache.spark.api.JavaDoubleRDD;
+import static org.apache.spark.mllib.random.RandomRDDs.*;
+
+JavaSparkContext jsc = ...
+
+// Generate a random double RDD that contains 1 million i.i.d. values drawn from the
+// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
+JavaDoubleRDD u = normalJavaRDD(jsc, 1000000L, 10);
+// Apply a transform to get a random double RDD following `N(1, 4)`.
+JavaDoubleRDD v = u.map(
+  new Function<Double, Double>() {
+    public Double call(Double x) {
+      return 1.0 + 2.0 * x;
+    }
+  });
+{% endhighlight %}
+</div>
+
+<div data-lang="python" markdown="1">
+[`RandomRDDs`](api/python/pyspark.mllib.random.RandomRDDs-class.html) provides factory
+methods to generate random double RDDs or vector RDDs.
+The following example generates a random double RDD, whose values follows the standard normal
+distribution `N(0, 1)`, and then map it to `N(1, 4)`.
+
+{% highlight python %}
+from pyspark.mllib.random import RandomRDDs
+
+sc = ... # SparkContext
+
+# Generate a random double RDD that contains 1 million i.i.d. values drawn from the
+# standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
+u = RandomRDDs.uniformRDD(sc, 1000000L, 10)
+# Apply a transform to get a random double RDD following `N(1, 4)`.
+v = u.map(lambda x: 1.0 + 2.0 * x)
+{% endhighlight %}
+</div>
+
+</div>
 
 ## Stratified Sampling
author	Xiangrui Meng <meng@databricks.com>	2014-08-19 16:06:48 -0700
committer	Xiangrui Meng <meng@databricks.com>	2014-08-19 16:06:48 -0700
commit	825d4fe47b9c4d48de88622dd48dcf83beb8b80a (patch)
tree	d51775e9f88bff51458e57a5ec16de6e0b93b91a /docs/mllib-stats.md
parent	d7e80c2597d4a9cae2e0cb35a86f7889323f4cbb (diff)
download	spark-825d4fe47b9c4d48de88622dd48dcf83beb8b80a.tar.gz spark-825d4fe47b9c4d48de88622dd48dcf83beb8b80a.tar.bz2 spark-825d4fe47b9c4d48de88622dd48dcf83beb8b80a.zip