[SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, spark.mllib and mllib in the documentation.

Replaces a number of occurences of `MLlib` in the documentation that were meant to refer to the `spark.mllib` package instead. It should clarify for new users the difference between `spark.mllib` (the package) and MLlib (the umbrella project for ML in spark). It also removes some files that I forgot to delete with #10207 Author: Timothy Hunter <timhunter@databricks.com> Closes #10234 from thunterdb/12212.
author: Timothy Hunter <timhunter@databricks.com> 2015-12-10 12:50:46 -0800
committer: Joseph K. Bradley <joseph@databricks.com> 2015-12-10 12:50:46 -0800
commit: 2ecbe02d5b28ee562d10c1735244b90a08532c9e (patch)
tree: c589a01a2900513aa1b277303ed7cdffc1961ba4 /docs/mllib-statistics.md
parent: ec5f9ed5de2218938dba52152475daafd4dc4786 (diff)
download: spark-2ecbe02d5b28ee562d10c1735244b90a08532c9e.tar.gz
spark-2ecbe02d5b28ee562d10c1735244b90a08532c9e.tar.bz2
spark-2ecbe02d5b28ee562d10c1735244b90a08532c9e.zip
1 files changed, 9 insertions, 9 deletions
diff --git a/docs/mllib-statistics.md b/docs/mllib-statistics.md
index de209f68e1..652d215fa8 100644
--- a/docs/mllib-statistics.md
+++ b/docs/mllib-statistics.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Basic Statistics - MLlib
-displayTitle: <a href="mllib-guide.html">MLlib</a> - Basic Statistics 
+title: Basic Statistics - spark.mllib
+displayTitle: Basic Statistics - spark.mllib
 ---
 
 * Table of contents
@@ -112,7 +112,7 @@ print(summary.numNonzeros())
 
 ## Correlations
 
-Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
+Calculating the correlation between two series of data is a common operation in Statistics. In `spark.mllib`
 we provide the flexibility to calculate pairwise correlations among many series. The supported 
 correlation methods are currently Pearson's and Spearman's correlation.
  
@@ -209,7 +209,7 @@ print(Statistics.corr(data, method="pearson"))
 
 ## Stratified sampling
 
-Unlike the other statistics functions, which reside in MLlib, stratified sampling methods,
+Unlike the other statistics functions, which reside in `spark.mllib`, stratified sampling methods,
 `sampleByKey` and `sampleByKeyExact`, can be performed on RDD's of key-value pairs. For stratified
 sampling, the keys can be thought of as a label and the value as a specific attribute. For example 
 the key can be man or woman, or document ids, and the respective values can be the list of ages 
@@ -294,12 +294,12 @@ approxSample = data.sampleByKey(False, fractions);
 ## Hypothesis testing
 
 Hypothesis testing is a powerful tool in statistics to determine whether a result is statistically 
-significant, whether this result occurred by chance or not. MLlib currently supports Pearson's 
+significant, whether this result occurred by chance or not. `spark.mllib` currently supports Pearson's 
 chi-squared ( $\chi^2$) tests for goodness of fit and independence. The input data types determine
 whether the goodness of fit or the independence test is conducted. The goodness of fit test requires 
 an input type of `Vector`, whereas the independence test requires a `Matrix` as input.
 
-MLlib also supports the input type `RDD[LabeledPoint]` to enable feature selection via chi-squared 
+`spark.mllib` also supports the input type `RDD[LabeledPoint]` to enable feature selection via chi-squared 
 independence tests.
 
 <div class="codetabs">
@@ -438,7 +438,7 @@ for i, result in enumerate(featureTestResults):
 
 </div>
 
-Additionally, MLlib provides a 1-sample, 2-sided implementation of the Kolmogorov-Smirnov (KS) test
+Additionally, `spark.mllib` provides a 1-sample, 2-sided implementation of the Kolmogorov-Smirnov (KS) test
 for equality of probability distributions. By providing the name of a theoretical distribution
 (currently solely supported for the normal distribution) and its parameters, or a function to 
 calculate the cumulative distribution according to a given theoretical distribution, the user can
@@ -522,7 +522,7 @@ print(testResult) # summary of the test including the p-value, test statistic,
 </div>
 
 ### Streaming Significance Testing
-MLlib provides online implementations of some tests to support use cases
+`spark.mllib` provides online implementations of some tests to support use cases
 like A/B testing. These tests may be performed on a Spark Streaming
 `DStream[(Boolean,Double)]` where the first element of each tuple
 indicates control group (`false`) or treatment group (`true`) and the
@@ -550,7 +550,7 @@ provides streaming hypothesis testing.
 ## Random data generation
 
 Random data generation is useful for randomized algorithms, prototyping, and performance testing.
-MLlib supports generating random RDDs with i.i.d. values drawn from a given distribution:
+`spark.mllib` supports generating random RDDs with i.i.d. values drawn from a given distribution:
 uniform, standard normal, or Poisson.
 
 <div class="codetabs">
author	Timothy Hunter <timhunter@databricks.com>	2015-12-10 12:50:46 -0800
committer	Joseph K. Bradley <joseph@databricks.com>	2015-12-10 12:50:46 -0800
commit	2ecbe02d5b28ee562d10c1735244b90a08532c9e (patch)
tree	c589a01a2900513aa1b277303ed7cdffc1961ba4 /docs/mllib-statistics.md
parent	ec5f9ed5de2218938dba52152475daafd4dc4786 (diff)
download	spark-2ecbe02d5b28ee562d10c1735244b90a08532c9e.tar.gz spark-2ecbe02d5b28ee562d10c1735244b90a08532c9e.tar.bz2 spark-2ecbe02d5b28ee562d10c1735244b90a08532c9e.zip