diff options
author | Martin Jaggi <m.jaggi@gmail.com> | 2014-02-08 11:39:13 -0800 |
---|---|---|
committer | Patrick Wendell <pwendell@gmail.com> | 2014-02-08 11:39:13 -0800 |
commit | fabf1749995103841e6a3975892572f376ee48d0 (patch) | |
tree | a9c03486cce6cc4f390405f33266a31861ebe3d4 /docs/mllib-clustering.md | |
parent | 3a9d82cc9e85accb5c1577cf4718aa44c8d5038c (diff) | |
download | spark-fabf1749995103841e6a3975892572f376ee48d0.tar.gz spark-fabf1749995103841e6a3975892572f376ee48d0.tar.bz2 spark-fabf1749995103841e6a3975892572f376ee48d0.zip |
Merge pull request #552 from martinjaggi/master. Closes #552.
tex formulas in the documentation
using mathjax.
and spliting the MLlib documentation by techniques
see jira
https://spark-project.atlassian.net/browse/MLLIB-19
and
https://github.com/shivaram/spark/compare/mathjax
Author: Martin Jaggi <m.jaggi@gmail.com>
== Merge branch commits ==
commit 0364bfabbfc347f917216057a20c39b631842481
Author: Martin Jaggi <m.jaggi@gmail.com>
Date: Fri Feb 7 03:19:38 2014 +0100
minor polishing, as suggested by @pwendell
commit dcd2142c164b2f602bf472bb152ad55bae82d31a
Author: Martin Jaggi <m.jaggi@gmail.com>
Date: Thu Feb 6 18:04:26 2014 +0100
enabling inline latex formulas with $.$
same mathjax configuration as used in math.stackexchange.com
sample usage in the linear algebra (SVD) documentation
commit bbafafd2b497a5acaa03a140bb9de1fbb7d67ffa
Author: Martin Jaggi <m.jaggi@gmail.com>
Date: Thu Feb 6 17:31:29 2014 +0100
split MLlib documentation by techniques
and linked from the main mllib-guide.md site
commit d1c5212b93c67436543c2d8ddbbf610fdf0a26eb
Author: Martin Jaggi <m.jaggi@gmail.com>
Date: Thu Feb 6 16:59:43 2014 +0100
enable mathjax formula in the .md documentation files
code by @shivaram
commit d73948db0d9bc36296054e79fec5b1a657b4eab4
Author: Martin Jaggi <m.jaggi@gmail.com>
Date: Thu Feb 6 16:57:23 2014 +0100
minor update on how to compile the documentation
Diffstat (limited to 'docs/mllib-clustering.md')
-rw-r--r-- | docs/mllib-clustering.md | 106 |
1 files changed, 106 insertions, 0 deletions
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md new file mode 100644 index 0000000000..65ed75b82e --- /dev/null +++ b/docs/mllib-clustering.md @@ -0,0 +1,106 @@ +--- +layout: global +title: MLlib - Clustering +--- + +* Table of contents +{:toc} + + +# Clustering + +Clustering is an unsupervised learning problem whereby we aim to group subsets +of entities with one another based on some notion of similarity. Clustering is +often used for exploratory analysis and/or as a component of a hierarchical +supervised learning pipeline (in which distinct classifiers or regression +models are trained for each cluster). MLlib supports +[k-means](http://en.wikipedia.org/wiki/K-means_clustering) clustering, one of +the most commonly used clustering algorithms that clusters the data points into +predfined number of clusters. The MLlib implementation includes a parallelized +variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method +called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf). +The implementation in MLlib has the following parameters: + +* *k* is the number of desired clusters. +* *maxIterations* is the maximum number of iterations to run. +* *initializationMode* specifies either random initialization or +initialization via k-means\|\|. +* *runs* is the number of times to run the k-means algorithm (k-means is not +guaranteed to find a globally optimal solution, and when run multiple times on +a given dataset, the algorithm returns the best clustering result). +* *initializiationSteps* determines the number of steps in the k-means\|\| algorithm. +* *epsilon* determines the distance threshold within which we consider k-means to have converged. + +Available algorithms for clustering: + +* [KMeans](api/mllib/index.html#org.apache.spark.mllib.clustering.KMeans) + + + +# Usage in Scala + +Following code snippets can be executed in `spark-shell`. + +In the following example after loading and parsing data, we use the KMeans object to cluster the data +into two clusters. The number of desired clusters is passed to the algorithm. We then compute Within +Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing *k*. In fact the +optimal *k* is usually one where there is an "elbow" in the WSSSE graph. + +{% highlight scala %} +import org.apache.spark.mllib.clustering.KMeans + +// Load and parse the data +val data = sc.textFile("kmeans_data.txt") +val parsedData = data.map( _.split(' ').map(_.toDouble)) + +// Cluster the data into two classes using KMeans +val numIterations = 20 +val numClusters = 2 +val clusters = KMeans.train(parsedData, numClusters, numIterations) + +// Evaluate clustering by computing Within Set Sum of Squared Errors +val WSSSE = clusters.computeCost(parsedData) +println("Within Set Sum of Squared Errors = " + WSSSE) +{% endhighlight %} + + +# Usage in Java + +All of MLlib's methods use Java-friendly types, so you can import and call them there the same +way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the +Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to a Scala one by +calling `.rdd()` on your `JavaRDD` object. + +# Usage in Python +Following examples can be tested in the PySpark shell. + +In the following example after loading and parsing data, we use the KMeans object to cluster the data +into two clusters. The number of desired clusters is passed to the algorithm. We then compute Within +Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing *k*. In fact the +optimal *k* is usually one where there is an "elbow" in the WSSSE graph. + +{% highlight python %} +from pyspark.mllib.clustering import KMeans +from numpy import array +from math import sqrt + +# Load and parse the data +data = sc.textFile("kmeans_data.txt") +parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')])) + +# Build the model (cluster the data) +clusters = KMeans.train(parsedData, 2, maxIterations=10, + runs=30, initialization_mode="random") + +# Evaluate clustering by computing Within Set Sum of Squared Errors +def error(point): + center = clusters.centers[clusters.predict(point)] + return sqrt(sum([x**2 for x in (point - center)])) + +WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y) +print("Within Set Sum of Squared Error = " + str(WSSSE)) +{% endhighlight %} + +Similarly you can use RidgeRegressionWithSGD and LassoWithSGD and compare training Mean Squared +Errors. + |