aboutsummaryrefslogtreecommitdiff
path: root/docs/mllib-statistics.md
diff options
context:
space:
mode:
authorSandy Ryza <sandy@cloudera.com>2015-08-17 17:57:51 -0700
committerXiangrui Meng <meng@databricks.com>2015-08-17 17:57:51 -0700
commitf9d1a92aa1bac4494022d78559b871149579e6e8 (patch)
treeb52a21dbf3d2bea804adc983fd564db00665687d /docs/mllib-statistics.md
parent0b6b01761370629ce387c143a25d41f3a334ff28 (diff)
downloadspark-f9d1a92aa1bac4494022d78559b871149579e6e8.tar.gz
spark-f9d1a92aa1bac4494022d78559b871149579e6e8.tar.bz2
spark-f9d1a92aa1bac4494022d78559b871149579e6e8.zip
[SPARK-7707] User guide and example code for KernelDensity
Author: Sandy Ryza <sandy@cloudera.com> Closes #8230 from sryza/sandy-spark-7707.
Diffstat (limited to 'docs/mllib-statistics.md')
-rw-r--r--docs/mllib-statistics.md77
1 files changed, 77 insertions, 0 deletions
diff --git a/docs/mllib-statistics.md b/docs/mllib-statistics.md
index be04d0b4b5..80a9d064c0 100644
--- a/docs/mllib-statistics.md
+++ b/docs/mllib-statistics.md
@@ -528,5 +528,82 @@ u = RandomRDDs.uniformRDD(sc, 1000000L, 10)
v = u.map(lambda x: 1.0 + 2.0 * x)
{% endhighlight %}
</div>
+</div>
+
+## Kernel density estimation
+
+[Kernel density estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation) is a technique
+useful for visualizing empirical probability distributions without requiring assumptions about the
+particular distribution that the observed samples are drawn from. It computes an estimate of the
+probability density function of a random variables, evaluated at a given set of points. It achieves
+this estimate by expressing the PDF of the empirical distribution at a particular point as the the
+mean of PDFs of normal distributions centered around each of the samples.
+
+<div class="codetabs">
+
+<div data-lang="scala" markdown="1">
+[`KernelDensity`](api/scala/index.html#org.apache.spark.mllib.stat.KernelDensity) provides methods
+to compute kernel density estimates from an RDD of samples. The following example demonstrates how
+to do so.
+
+{% highlight scala %}
+import org.apache.spark.mllib.stat.KernelDensity
+import org.apache.spark.rdd.RDD
+
+val data: RDD[Double] = ... // an RDD of sample data
+
+// Construct the density estimator with the sample data and a standard deviation for the Gaussian
+// kernels
+val kd = new KernelDensity()
+ .setSample(data)
+ .setBandwidth(3.0)
+
+// Find density estimates for the given values
+val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+[`KernelDensity`](api/java/index.html#org.apache.spark.mllib.stat.KernelDensity) provides methods
+to compute kernel density estimates from an RDD of samples. The following example demonstrates how
+to do so.
+
+{% highlight java %}
+import org.apache.spark.mllib.stat.KernelDensity;
+import org.apache.spark.rdd.RDD;
+
+RDD<Double> data = ... // an RDD of sample data
+
+// Construct the density estimator with the sample data and a standard deviation for the Gaussian
+// kernels
+KernelDensity kd = new KernelDensity()
+ .setSample(data)
+ .setBandwidth(3.0);
+
+// Find density estimates for the given values
+double[] densities = kd.estimate(new double[] {-1.0, 2.0, 5.0});
+{% endhighlight %}
+</div>
+
+<div data-lang="python" markdown="1">
+[`KernelDensity`](api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity) provides methods
+to compute kernel density estimates from an RDD of samples. The following example demonstrates how
+to do so.
+
+{% highlight python %}
+from pyspark.mllib.stat import KernelDensity
+
+data = ... # an RDD of sample data
+
+# Construct the density estimator with the sample data and a standard deviation for the Gaussian
+# kernels
+kd = KernelDensity()
+kd.setSample(data)
+kd.setBandwidth(3.0)
+
+# Find density estimates for the given values
+densities = kd.estimate([-1.0, 2.0, 5.0])
+{% endhighlight %}
+</div>
</div>