[SPARK-13019][DOCS] fix for scala-2.10 build: Replace example code in mllib-statistics.md using include_example

## What changes were proposed in this pull request? This PR for ticket SPARK-13019 is based on previous PR(https://github.com/apache/spark/pull/11108). Since PR(https://github.com/apache/spark/pull/11108) is breaking scala-2.10 build, more work is needed to fix build errors. What I did new in this PR is adding keyword argument for 'fractions': ` val approxSample = data.sampleByKey(withReplacement = false, fractions = fractions)` ` val exactSample = data.sampleByKeyExact(withReplacement = false, fractions = fractions)` I reopened ticket on JIRA but sorry I don't know how to reopen a GitHub pull request, so I just submitting a new pull request. ## How was this patch tested? Manual build testing on local machine, build based on scala-2.10. Author: Xin Ren <iamshrek@126.com> Closes #11901 from keypointt/SPARK-13019.
author: Xin Ren <iamshrek@126.com> 2016-03-24 09:34:54 +0000
committer: Sean Owen <sowen@cloudera.com> 2016-03-24 09:34:54 +0000
commit: dd9ca7b9607cb4ade287b646905d92064ac94d6f (patch)
tree: 07463b657cf83cf714b59076f4ef5e18d6a589be /docs/mllib-statistics.md
parent: 048a7594e2bfd2a3e531ecfa8ebbcc2032c1dac2 (diff)
download: spark-dd9ca7b9607cb4ade287b646905d92064ac94d6f.tar.gz
spark-dd9ca7b9607cb4ade287b646905d92064ac94d6f.tar.bz2
spark-dd9ca7b9607cb4ade287b646905d92064ac94d6f.zip
1 files changed, 56 insertions, 382 deletions
diff --git a/docs/mllib-statistics.md b/docs/mllib-statistics.md
index b773031bc7..02b81f153b 100644
--- a/docs/mllib-statistics.md
+++ b/docs/mllib-statistics.md
@@ -10,24 +10,24 @@ displayTitle: Basic Statistics - spark.mllib
 
 `\[
 \newcommand{\R}{\mathbb{R}}
-\newcommand{\E}{\mathbb{E}} 
+\newcommand{\E}{\mathbb{E}}
 \newcommand{\x}{\mathbf{x}}
 \newcommand{\y}{\mathbf{y}}
 \newcommand{\wv}{\mathbf{w}}
 \newcommand{\av}{\mathbf{\alpha}}
 \newcommand{\bv}{\mathbf{b}}
 \newcommand{\N}{\mathbb{N}}
-\newcommand{\id}{\mathbf{I}} 
-\newcommand{\ind}{\mathbf{1}} 
-\newcommand{\0}{\mathbf{0}} 
-\newcommand{\unit}{\mathbf{e}} 
-\newcommand{\one}{\mathbf{1}} 
+\newcommand{\id}{\mathbf{I}}
+\newcommand{\ind}{\mathbf{1}}
+\newcommand{\0}{\mathbf{0}}
+\newcommand{\unit}{\mathbf{e}}
+\newcommand{\one}{\mathbf{1}}
 \newcommand{\zero}{\mathbf{0}}
 \]`
 
-## Summary statistics 
+## Summary statistics
 
-We provide column summary statistics for `RDD[Vector]` through the function `colStats` 
+We provide column summary statistics for `RDD[Vector]` through the function `colStats`
 available in `Statistics`.
 
 <div class="codetabs">
@@ -40,19 +40,7 @@ total count.
 
 Refer to the [`MultivariateStatisticalSummary` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary) for details on the API.
 
-{% highlight scala %}
-import org.apache.spark.mllib.linalg.Vector
-import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
-
-val observations: RDD[Vector] = ... // an RDD of Vectors
-
-// Compute column summary statistics.
-val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
-println(summary.mean) // a dense vector containing the mean value for each column
-println(summary.variance) // column-wise variance
-println(summary.numNonzeros) // number of nonzeros in each column
-
-{% endhighlight %}
+{% include_example scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}
 </div>
 
 <div data-lang="java" markdown="1">
@@ -64,24 +52,7 @@ total count.
 
 Refer to the [`MultivariateStatisticalSummary` Java docs](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html) for details on the API.
 
-{% highlight java %}
-import org.apache.spark.api.java.JavaRDD;
-import org.apache.spark.api.java.JavaSparkContext;
-import org.apache.spark.mllib.linalg.Vector;
-import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
-import org.apache.spark.mllib.stat.Statistics;
-
-JavaSparkContext jsc = ...
-
-JavaRDD<Vector> mat = ... // an RDD of Vectors
-
-// Compute column summary statistics.
-MultivariateStatisticalSummary summary = Statistics.colStats(mat.rdd());
-System.out.println(summary.mean()); // a dense vector containing the mean value for each column
-System.out.println(summary.variance()); // column-wise variance
-System.out.println(summary.numNonzeros()); // number of nonzeros in each column
-
-{% endhighlight %}
+{% include_example java/org/apache/spark/examples/mllib/JavaSummaryStatisticsExample.java %}
 </div>
 
 <div data-lang="python" markdown="1">
@@ -92,20 +63,7 @@ total count.
 
 Refer to the [`MultivariateStatisticalSummary` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.stat.MultivariateStatisticalSummary) for more details on the API.
 
-{% highlight python %}
-from pyspark.mllib.stat import Statistics
-
-sc = ... # SparkContext
-
-mat = ... # an RDD of Vectors
-
-# Compute column summary statistics.
-summary = Statistics.colStats(mat)
-print(summary.mean())
-print(summary.variance())
-print(summary.numNonzeros())
-
-{% endhighlight %}
+{% include_example python/mllib/summary_statistics_example.py %}
 </div>
 
 </div>
@@ -113,96 +71,38 @@ print(summary.numNonzeros())
 ## Correlations
 
 Calculating the correlation between two series of data is a common operation in Statistics. In `spark.mllib`
-we provide the flexibility to calculate pairwise correlations among many series. The supported 
+we provide the flexibility to calculate pairwise correlations among many series. The supported
 correlation methods are currently Pearson's and Spearman's correlation.
- 
+
 <div class="codetabs">
 <div data-lang="scala" markdown="1">
-[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
-calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
+[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to
+calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or
 an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
 
 Refer to the [`Statistics` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.Statistics) for details on the API.
 
-{% highlight scala %}
-import org.apache.spark.SparkContext
-import org.apache.spark.mllib.linalg._
-import org.apache.spark.mllib.stat.Statistics
-
-val sc: SparkContext = ...
-
-val seriesX: RDD[Double] = ... // a series
-val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX
-
-// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
-// method is not specified, Pearson's method will be used by default. 
-val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
-
-val data: RDD[Vector] = ... // note that each Vector is a row and not a column
-
-// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
-// If a method is not specified, Pearson's method will be used by default. 
-val correlMatrix: Matrix = Statistics.corr(data, "pearson")
-
-{% endhighlight %}
+{% include_example scala/org/apache/spark/examples/mllib/CorrelationsExample.scala %}
 </div>
 
 <div data-lang="java" markdown="1">
-[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
-calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or 
+[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to
+calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or
 a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively.
 
 Refer to the [`Statistics` Java docs](api/java/org/apache/spark/mllib/stat/Statistics.html) for details on the API.
 
-{% highlight java %}
-import org.apache.spark.api.java.JavaDoubleRDD;
-import org.apache.spark.api.java.JavaSparkContext;
-import org.apache.spark.mllib.linalg.*;
-import org.apache.spark.mllib.stat.Statistics;
-
-JavaSparkContext jsc = ...
-
-JavaDoubleRDD seriesX = ... // a series
-JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX
-
-// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
-// method is not specified, Pearson's method will be used by default. 
-Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");
-
-JavaRDD<Vector> data = ... // note that each Vector is a row and not a column
-
-// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
-// If a method is not specified, Pearson's method will be used by default. 
-Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
-
-{% endhighlight %}
+{% include_example java/org/apache/spark/examples/mllib/JavaCorrelationsExample.java %}
 </div>
 
 <div data-lang="python" markdown="1">
-[`Statistics`](api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics) provides methods to 
-calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or 
+[`Statistics`](api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics) provides methods to
+calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or
 an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
 
 Refer to the [`Statistics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics) for more details on the API.
 
-{% highlight python %}
-from pyspark.mllib.stat import Statistics
-
-sc = ... # SparkContext
-
-seriesX = ... # a series
-seriesY = ... # must have the same number of partitions and cardinality as seriesX
-
-# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
-# method is not specified, Pearson's method will be used by default. 
-print(Statistics.corr(seriesX, seriesY, method="pearson"))
-
-data = ... # an RDD of Vectors
-# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
-# If a method is not specified, Pearson's method will be used by default. 
-print(Statistics.corr(data, method="pearson"))
-
-{% endhighlight %}
+{% include_example python/mllib/correlations_example.py %}
 </div>
 
 </div>
@@ -211,187 +111,76 @@ print(Statistics.corr(data, method="pearson"))
 
 Unlike the other statistics functions, which reside in `spark.mllib`, stratified sampling methods,
 `sampleByKey` and `sampleByKeyExact`, can be performed on RDD's of key-value pairs. For stratified
-sampling, the keys can be thought of as a label and the value as a specific attribute. For example 
-the key can be man or woman, or document ids, and the respective values can be the list of ages 
-of the people in the population or the list of words in the documents. The `sampleByKey` method 
-will flip a coin to decide whether an observation will be sampled or not, therefore requires one 
-pass over the data, and provides an *expected* sample size. `sampleByKeyExact` requires significant 
+sampling, the keys can be thought of as a label and the value as a specific attribute. For example
+the key can be man or woman, or document ids, and the respective values can be the list of ages
+of the people in the population or the list of words in the documents. The `sampleByKey` method
+will flip a coin to decide whether an observation will be sampled or not, therefore requires one
+pass over the data, and provides an *expected* sample size. `sampleByKeyExact` requires significant
 more resources than the per-stratum simple random sampling used in `sampleByKey`, but will provide
-the exact sampling size with 99.99% confidence. `sampleByKeyExact` is currently not supported in 
+the exact sampling size with 99.99% confidence. `sampleByKeyExact` is currently not supported in
 python.
 
 <div class="codetabs">
 <div data-lang="scala" markdown="1">
 [`sampleByKeyExact()`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) allows users to
-sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
+sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired
 fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and $K$ is the set of
-keys. Sampling without replacement requires one additional pass over the RDD to guarantee sample 
+keys. Sampling without replacement requires one additional pass over the RDD to guarantee sample
 size, whereas sampling with replacement requires two additional passes.
 
-{% highlight scala %}
-import org.apache.spark.SparkContext
-import org.apache.spark.SparkContext._
-import org.apache.spark.rdd.PairRDDFunctions
-
-val sc: SparkContext = ...
-
-val data = ... // an RDD[(K, V)] of any key value pairs
-val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
-
-// Get an exact sample from each stratum
-val approxSample = data.sampleByKey(withReplacement = false, fractions)
-val exactSample = data.sampleByKeyExact(withReplacement = false, fractions)
-
-{% endhighlight %}
+{% include_example scala/org/apache/spark/examples/mllib/StratifiedSamplingExample.scala %}
 </div>
 
 <div data-lang="java" markdown="1">
 [`sampleByKeyExact()`](api/java/org/apache/spark/api/java/JavaPairRDD.html) allows users to
-sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired 
+sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired
 fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and $K$ is the set of
-keys. Sampling without replacement requires one additional pass over the RDD to guarantee sample 
+keys. Sampling without replacement requires one additional pass over the RDD to guarantee sample
 size, whereas sampling with replacement requires two additional passes.
 
-{% highlight java %}
-import java.util.Map;
-
-import org.apache.spark.api.java.JavaPairRDD;
-import org.apache.spark.api.java.JavaSparkContext;
-
-JavaSparkContext jsc = ...
-
-JavaPairRDD<K, V> data = ... // an RDD of any key value pairs
-Map<K, Object> fractions = ... // specify the exact fraction desired from each key
-
-// Get an exact sample from each stratum
-JavaPairRDD<K, V> approxSample = data.sampleByKey(false, fractions);
-JavaPairRDD<K, V> exactSample = data.sampleByKeyExact(false, fractions);
-
-{% endhighlight %}
+{% include_example java/org/apache/spark/examples/mllib/JavaStratifiedSamplingExample.java %}
 </div>
 <div data-lang="python" markdown="1">
 [`sampleByKey()`](api/python/pyspark.html#pyspark.RDD.sampleByKey) allows users to
-sample approximately $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the 
-desired fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and $K$ is the 
+sample approximately $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the
+desired fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and $K$ is the
 set of keys.
 
 *Note:* `sampleByKeyExact()` is currently not supported in Python.
 
-{% highlight python %}
-
-sc = ... # SparkContext
-
-data = ... # an RDD of any key value pairs
-fractions = ... # specify the exact fraction desired from each key as a dictionary
-
-approxSample = data.sampleByKey(False, fractions);
-
-{% endhighlight %}
+{% include_example python/mllib/stratified_sampling_example.py %}
 </div>
 
 </div>
 
 ## Hypothesis testing
 
-Hypothesis testing is a powerful tool in statistics to determine whether a result is statistically 
-significant, whether this result occurred by chance or not. `spark.mllib` currently supports Pearson's 
+Hypothesis testing is a powerful tool in statistics to determine whether a result is statistically
+significant, whether this result occurred by chance or not. `spark.mllib` currently supports Pearson's
 chi-squared ( $\chi^2$) tests for goodness of fit and independence. The input data types determine
-whether the goodness of fit or the independence test is conducted. The goodness of fit test requires 
+whether the goodness of fit or the independence test is conducted. The goodness of fit test requires
 an input type of `Vector`, whereas the independence test requires a `Matrix` as input.
 
-`spark.mllib` also supports the input type `RDD[LabeledPoint]` to enable feature selection via chi-squared 
+`spark.mllib` also supports the input type `RDD[LabeledPoint]` to enable feature selection via chi-squared
 independence tests.
 
 <div class="codetabs">
 <div data-lang="scala" markdown="1">
-[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to 
-run Pearson's chi-squared tests. The following example demonstrates how to run and interpret 
+[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to
+run Pearson's chi-squared tests. The following example demonstrates how to run and interpret
 hypothesis tests.
 
-{% highlight scala %}
-import org.apache.spark.SparkContext
-import org.apache.spark.mllib.linalg._
-import org.apache.spark.mllib.regression.LabeledPoint
-import org.apache.spark.mllib.stat.Statistics._
-
-val sc: SparkContext = ...
-
-val vec: Vector = ... // a vector composed of the frequencies of events
-
-// compute the goodness of fit. If a second vector to test against is not supplied as a parameter, 
-// the test runs against a uniform distribution.  
-val goodnessOfFitTestResult = Statistics.chiSqTest(vec)
-println(goodnessOfFitTestResult) // summary of the test including the p-value, degrees of freedom, 
-                                 // test statistic, the method used, and the null hypothesis.
-
-val mat: Matrix = ... // a contingency matrix
-
-// conduct Pearson's independence test on the input contingency matrix
-val independenceTestResult = Statistics.chiSqTest(mat) 
-println(independenceTestResult) // summary of the test including the p-value, degrees of freedom...
-
-val obs: RDD[LabeledPoint] = ... // (feature, label) pairs.
-
-// The contingency table is constructed from the raw (feature, label) pairs and used to conduct
-// the independence test. Returns an array containing the ChiSquaredTestResult for every feature 
-// against the label.
-val featureTestResults: Array[ChiSqTestResult] = Statistics.chiSqTest(obs)
-var i = 1
-featureTestResults.foreach { result =>
-    println(s"Column $i:\n$result")
-    i += 1
-} // summary of the test 
-
-{% endhighlight %}
+{% include_example scala/org/apache/spark/examples/mllib/HypothesisTestingExample.scala %}
 </div>
 
 <div data-lang="java" markdown="1">
-[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to 
-run Pearson's chi-squared tests. The following example demonstrates how to run and interpret 
+[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to
+run Pearson's chi-squared tests. The following example demonstrates how to run and interpret
 hypothesis tests.
 
 Refer to the [`ChiSqTestResult` Java docs](api/java/org/apache/spark/mllib/stat/test/ChiSqTestResult.html) for details on the API.
 
-{% highlight java %}
-import org.apache.spark.api.java.JavaRDD;
-import org.apache.spark.api.java.JavaSparkContext;
-import org.apache.spark.mllib.linalg.*;
-import org.apache.spark.mllib.regression.LabeledPoint;
-import org.apache.spark.mllib.stat.Statistics;
-import org.apache.spark.mllib.stat.test.ChiSqTestResult;
-
-JavaSparkContext jsc = ...
-
-Vector vec = ... // a vector composed of the frequencies of events
-
-// compute the goodness of fit. If a second vector to test against is not supplied as a parameter, 
-// the test runs against a uniform distribution.  
-ChiSqTestResult goodnessOfFitTestResult = Statistics.chiSqTest(vec);
-// summary of the test including the p-value, degrees of freedom, test statistic, the method used, 
-// and the null hypothesis.
-System.out.println(goodnessOfFitTestResult);
-
-Matrix mat = ... // a contingency matrix
-
-// conduct Pearson's independence test on the input contingency matrix
-ChiSqTestResult independenceTestResult = Statistics.chiSqTest(mat);
-// summary of the test including the p-value, degrees of freedom...
-System.out.println(independenceTestResult);
-
-JavaRDD<LabeledPoint> obs = ... // an RDD of labeled points
-
-// The contingency table is constructed from the raw (feature, label) pairs and used to conduct
-// the independence test. Returns an array containing the ChiSquaredTestResult for every feature 
-// against the label.
-ChiSqTestResult[] featureTestResults = Statistics.chiSqTest(obs.rdd());
-int i = 1;
-for (ChiSqTestResult result : featureTestResults) {
-    System.out.println("Column " + i + ":");
-    System.out.println(result); // summary of the test
-    i++;
-}
-
-{% endhighlight %}
+{% include_example java/org/apache/spark/examples/mllib/JavaHypothesisTestingExample.java %}
 </div>
 
 <div data-lang="python" markdown="1">
@@ -401,50 +190,18 @@ hypothesis tests.
 
 Refer to the [`Statistics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics) for more details on the API.
 
-{% highlight python %}
-from pyspark import SparkContext
-from pyspark.mllib.linalg import Vectors, Matrices
-from pyspark.mllib.regresssion import LabeledPoint
-from pyspark.mllib.stat import Statistics
-
-sc = SparkContext()
-
-vec = Vectors.dense(...) # a vector composed of the frequencies of events
-
-# compute the goodness of fit. If a second vector to test against is not supplied as a parameter,
-# the test runs against a uniform distribution.
-goodnessOfFitTestResult = Statistics.chiSqTest(vec)
-print(goodnessOfFitTestResult) # summary of the test including the p-value, degrees of freedom,
-                               # test statistic, the method used, and the null hypothesis.
-
-mat = Matrices.dense(...) # a contingency matrix
-
-# conduct Pearson's independence test on the input contingency matrix
-independenceTestResult = Statistics.chiSqTest(mat)
-print(independenceTestResult)  # summary of the test including the p-value, degrees of freedom...
-
-obs = sc.parallelize(...)  # LabeledPoint(feature, label) .
-
-# The contingency table is constructed from an RDD of LabeledPoint and used to conduct
-# the independence test. Returns an array containing the ChiSquaredTestResult for every feature
-# against the label.
-featureTestResults = Statistics.chiSqTest(obs)
-
-for i, result in enumerate(featureTestResults):
-    print("Column $d:" % (i + 1))
-    print(result)
-{% endhighlight %}
+{% include_example python/mllib/hypothesis_testing_example.py %}
 </div>
 
 </div>
 
 Additionally, `spark.mllib` provides a 1-sample, 2-sided implementation of the Kolmogorov-Smirnov (KS) test
 for equality of probability distributions. By providing the name of a theoretical distribution
-(currently solely supported for the normal distribution) and its parameters, or a function to 
+(currently solely supported for the normal distribution) and its parameters, or a function to
 calculate the cumulative distribution according to a given theoretical distribution, the user can
 test the null hypothesis that their sample is drawn from that distribution. In the case that the
 user tests against the normal distribution (`distName="norm"`), but does not provide distribution
-parameters, the test initializes to the standard normal distribution and logs an appropriate 
+parameters, the test initializes to the standard normal distribution and logs an appropriate
 message.
 
 <div class="codetabs">
@@ -455,21 +212,7 @@ and interpret the hypothesis tests.
 
 Refer to the [`Statistics` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.Statistics) for details on the API.
 
-{% highlight scala %}
-import org.apache.spark.mllib.stat.Statistics
-
-val data: RDD[Double] = ... // an RDD of sample data
-
-// run a KS test for the sample versus a standard normal distribution
-val testResult = Statistics.kolmogorovSmirnovTest(data, "norm", 0, 1)
-println(testResult) // summary of the test including the p-value, test statistic,
-                    // and null hypothesis
-                    // if our p-value indicates significance, we can reject the null hypothesis
-
-// perform a KS test using a cumulative distribution function of our making
-val myCDF: Double => Double = ...
-val testResult2 = Statistics.kolmogorovSmirnovTest(data, myCDF)
-{% endhighlight %}
+{% include_example scala/org/apache/spark/examples/mllib/HypothesisTestingKolmogorovSmirnovTestExample.scala %}
 </div>
 
 <div data-lang="java" markdown="1">
@@ -479,23 +222,7 @@ and interpret the hypothesis tests.
 
 Refer to the [`Statistics` Java docs](api/java/org/apache/spark/mllib/stat/Statistics.html) for details on the API.
 
-{% highlight java %}
-import java.util.Arrays;
-
-import org.apache.spark.api.java.JavaDoubleRDD;
-import org.apache.spark.api.java.JavaSparkContext;
-
-import org.apache.spark.mllib.stat.Statistics;
-import org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult;
-
-JavaSparkContext jsc = ...
-JavaDoubleRDD data = jsc.parallelizeDoubles(Arrays.asList(0.2, 1.0, ...));
-KolmogorovSmirnovTestResult testResult = Statistics.kolmogorovSmirnovTest(data, "norm", 0.0, 1.0);
-// summary of the test including the p-value, test statistic,
-// and null hypothesis
-// if our p-value indicates significance, we can reject the null hypothesis
-System.out.println(testResult);
-{% endhighlight %}
+{% include_example java/org/apache/spark/examples/mllib/JavaHypothesisTestingKolmogorovSmirnovTestExample.java %}
 </div>
 
 <div data-lang="python" markdown="1">
@@ -505,19 +232,7 @@ and interpret the hypothesis tests.
 
 Refer to the [`Statistics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics) for more details on the API.
 
-{% highlight python %}
-from pyspark.mllib.stat import Statistics
-
-parallelData = sc.parallelize([1.0, 2.0, ... ])
-
-# run a KS test for the sample versus a standard normal distribution
-testResult = Statistics.kolmogorovSmirnovTest(parallelData, "norm", 0, 1)
-print(testResult) # summary of the test including the p-value, test statistic,
-                  # and null hypothesis
-                  # if our p-value indicates significance, we can reject the null hypothesis
-# Note that the Scala functionality of calling Statistics.kolmogorovSmirnovTest with
-# a lambda to calculate the CDF is not made available in the Python API
-{% endhighlight %}
+{% include_example python/mllib/hypothesis_testing_kolmogorov_smirnov_test_example.py %}
 </div>
 </div>
 
@@ -651,21 +366,7 @@ to do so.
 
 Refer to the [`KernelDensity` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.KernelDensity) for details on the API.
 
-{% highlight scala %}
-import org.apache.spark.mllib.stat.KernelDensity
-import org.apache.spark.rdd.RDD
-
-val data: RDD[Double] = ... // an RDD of sample data
-
-// Construct the density estimator with the sample data and a standard deviation for the Gaussian
-// kernels
-val kd = new KernelDensity()
-  .setSample(data)
-  .setBandwidth(3.0)
-
-// Find density estimates for the given values
-val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
-{% endhighlight %}
+{% include_example scala/org/apache/spark/examples/mllib/KernelDensityEstimationExample.scala %}
 </div>
 
 <div data-lang="java" markdown="1">
@@ -675,21 +376,7 @@ to do so.
 
 Refer to the [`KernelDensity` Java docs](api/java/org/apache/spark/mllib/stat/KernelDensity.html) for details on the API.
 
-{% highlight java %}
-import org.apache.spark.mllib.stat.KernelDensity;
-import org.apache.spark.rdd.RDD;
-
-RDD<Double> data = ... // an RDD of sample data
-
-// Construct the density estimator with the sample data and a standard deviation for the Gaussian
-// kernels
-KernelDensity kd = new KernelDensity()
-  .setSample(data)
-  .setBandwidth(3.0);
-
-// Find density estimates for the given values
-double[] densities = kd.estimate(new double[] {-1.0, 2.0, 5.0});
-{% endhighlight %}
+{% include_example java/org/apache/spark/examples/mllib/JavaKernelDensityEstimationExample.java %}
 </div>
 
 <div data-lang="python" markdown="1">
@@ -699,20 +386,7 @@ to do so.
 
 Refer to the [`KernelDensity` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity) for more details on the API.
 
-{% highlight python %}
-from pyspark.mllib.stat import KernelDensity
-
-data = ... # an RDD of sample data
-
-# Construct the density estimator with the sample data and a standard deviation for the Gaussian
-# kernels
-kd = KernelDensity()
-kd.setSample(data)
-kd.setBandwidth(3.0)
-
-# Find density estimates for the given values
-densities = kd.estimate([-1.0, 2.0, 5.0])
-{% endhighlight %}
+{% include_example python/mllib/kernel_density_estimation_example.py %}
 </div>
 
 </div>
author	Xin Ren <iamshrek@126.com>	2016-03-24 09:34:54 +0000
committer	Sean Owen <sowen@cloudera.com>	2016-03-24 09:34:54 +0000
commit	dd9ca7b9607cb4ade287b646905d92064ac94d6f (patch)
tree	07463b657cf83cf714b59076f4ef5e18d6a589be /docs/mllib-statistics.md
parent	048a7594e2bfd2a3e531ecfa8ebbcc2032c1dac2 (diff)
download	spark-dd9ca7b9607cb4ade287b646905d92064ac94d6f.tar.gz spark-dd9ca7b9607cb4ade287b646905d92064ac94d6f.tar.bz2 spark-dd9ca7b9607cb4ade287b646905d92064ac94d6f.zip