From 25ad8f93012730115a8a1fac649fe3e842c045b3 Mon Sep 17 00:00:00 2001 From: Sean Owen Date: Tue, 6 May 2014 20:07:22 -0700 Subject: SPARK-1727. Correct small compile errors, typos, and markdown issues in (primarly) MLlib docs While play-testing the Scala and Java code examples in the MLlib docs, I noticed a number of small compile errors, and some typos. This led to finding and fixing a few similar items in other docs. Then in the course of building the site docs to check the result, I found a few small suggestions for the build instructions. I also found a few more formatting and markdown issues uncovered when I accidentally used maruku instead of kramdown. Author: Sean Owen Closes #653 from srowen/SPARK-1727 and squashes the following commits: 6e7c38a [Sean Owen] Final doc updates - one more compile error, and use of mean instead of sum and count 8f5e847 [Sean Owen] Fix markdown syntax issues that maruku flags, even though we use kramdown (but only those that do not affect kramdown's output) 99966a9 [Sean Owen] Update issue tracker URL in docs 23c9ac3 [Sean Owen] Add Scala Naive Bayes example, to use existing example data file (whose format needed a tweak) 8c81982 [Sean Owen] Fix small compile errors and typos across MLlib docs --- docs/mllib-decision-tree.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'docs/mllib-decision-tree.md') diff --git a/docs/mllib-decision-tree.md b/docs/mllib-decision-tree.md index 0693766990..296277e58b 100644 --- a/docs/mllib-decision-tree.md +++ b/docs/mllib-decision-tree.md @@ -83,19 +83,19 @@ Section 9.2.4 in [Elements of Statistical Machine Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/) for details). For example, for a binary classification problem with one categorical feature with three categories A, B and C with corresponding proportion of label 1 as 0.2, 0.6 and 0.4, the categorical -features are orded as A followed by C followed B or A, B, C. The two split candidates are A \| C, B +features are ordered as A followed by C followed B or A, B, C. The two split candidates are A \| C, B and A , B \| C where \| denotes the split. ### Stopping rule The recursive tree construction is stopped at a node when one of the two conditions is met: -1. The node depth is equal to the `maxDepth` training parammeter +1. The node depth is equal to the `maxDepth` training parameter 2. No split candidate leads to an information gain at the node. ### Practical limitations -1. The tree implementation stores an Array[Double] of size *O(#features \* #splits \* 2^maxDepth)* +1. The tree implementation stores an `Array[Double]` of size *O(#features \* #splits \* 2^maxDepth)* in memory for aggregating histograms over partitions. The current implementation might not scale to very deep trees since the memory requirement grows exponentially with tree depth. 2. The implemented algorithm reads both sparse and dense data. However, it is not optimized for @@ -178,7 +178,7 @@ val valuesAndPreds = parsedData.map { point => val prediction = model.predict(point.features) (point.label, prediction) } -val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2)}.reduce(_ + _)/valuesAndPreds.count +val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2)}.mean() println("training Mean Squared Error = " + MSE) {% endhighlight %} -- cgit v1.2.3