diff options
author | Sandy Ryza <sandy@cloudera.com> | 2015-08-17 17:57:51 -0700 |
---|---|---|
committer | Xiangrui Meng <meng@databricks.com> | 2015-08-17 17:57:51 -0700 |
commit | f9d1a92aa1bac4494022d78559b871149579e6e8 (patch) | |
tree | b52a21dbf3d2bea804adc983fd564db00665687d /docs | |
parent | 0b6b01761370629ce387c143a25d41f3a334ff28 (diff) | |
download | spark-f9d1a92aa1bac4494022d78559b871149579e6e8.tar.gz spark-f9d1a92aa1bac4494022d78559b871149579e6e8.tar.bz2 spark-f9d1a92aa1bac4494022d78559b871149579e6e8.zip |
[SPARK-7707] User guide and example code for KernelDensity
Author: Sandy Ryza <sandy@cloudera.com>
Closes #8230 from sryza/sandy-spark-7707.
Diffstat (limited to 'docs')
-rw-r--r-- | docs/mllib-statistics.md | 77 |
1 files changed, 77 insertions, 0 deletions
diff --git a/docs/mllib-statistics.md b/docs/mllib-statistics.md index be04d0b4b5..80a9d064c0 100644 --- a/docs/mllib-statistics.md +++ b/docs/mllib-statistics.md @@ -528,5 +528,82 @@ u = RandomRDDs.uniformRDD(sc, 1000000L, 10) v = u.map(lambda x: 1.0 + 2.0 * x) {% endhighlight %} </div> +</div> + +## Kernel density estimation + +[Kernel density estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation) is a technique +useful for visualizing empirical probability distributions without requiring assumptions about the +particular distribution that the observed samples are drawn from. It computes an estimate of the +probability density function of a random variables, evaluated at a given set of points. It achieves +this estimate by expressing the PDF of the empirical distribution at a particular point as the the +mean of PDFs of normal distributions centered around each of the samples. + +<div class="codetabs"> + +<div data-lang="scala" markdown="1"> +[`KernelDensity`](api/scala/index.html#org.apache.spark.mllib.stat.KernelDensity) provides methods +to compute kernel density estimates from an RDD of samples. The following example demonstrates how +to do so. + +{% highlight scala %} +import org.apache.spark.mllib.stat.KernelDensity +import org.apache.spark.rdd.RDD + +val data: RDD[Double] = ... // an RDD of sample data + +// Construct the density estimator with the sample data and a standard deviation for the Gaussian +// kernels +val kd = new KernelDensity() + .setSample(data) + .setBandwidth(3.0) + +// Find density estimates for the given values +val densities = kd.estimate(Array(-1.0, 2.0, 5.0)) +{% endhighlight %} +</div> + +<div data-lang="java" markdown="1"> +[`KernelDensity`](api/java/index.html#org.apache.spark.mllib.stat.KernelDensity) provides methods +to compute kernel density estimates from an RDD of samples. The following example demonstrates how +to do so. + +{% highlight java %} +import org.apache.spark.mllib.stat.KernelDensity; +import org.apache.spark.rdd.RDD; + +RDD<Double> data = ... // an RDD of sample data + +// Construct the density estimator with the sample data and a standard deviation for the Gaussian +// kernels +KernelDensity kd = new KernelDensity() + .setSample(data) + .setBandwidth(3.0); + +// Find density estimates for the given values +double[] densities = kd.estimate(new double[] {-1.0, 2.0, 5.0}); +{% endhighlight %} +</div> + +<div data-lang="python" markdown="1"> +[`KernelDensity`](api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity) provides methods +to compute kernel density estimates from an RDD of samples. The following example demonstrates how +to do so. + +{% highlight python %} +from pyspark.mllib.stat import KernelDensity + +data = ... # an RDD of sample data + +# Construct the density estimator with the sample data and a standard deviation for the Gaussian +# kernels +kd = KernelDensity() +kd.setSample(data) +kd.setBandwidth(3.0) + +# Find density estimates for the given values +densities = kd.estimate([-1.0, 2.0, 5.0]) +{% endhighlight %} +</div> </div> |