[MINOR][DOCS] Fix all typos in markdown files of `doc` and similar patterns in other comments

## What changes were proposed in this pull request? This PR tries to fix all typos in all markdown files under `docs` module, and fixes similar typos in other comments, too. ## How was the this patch tested? manual tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11300 from dongjoon-hyun/minor_fix_typos.
author: Dongjoon Hyun <dongjoon@apache.org> 2016-02-22 09:52:07 +0000
committer: Sean Owen <sowen@cloudera.com> 2016-02-22 09:52:07 +0000
commit: 024482bf51e8158eed08a7dc0758f585baf86e1f (patch)
tree: e51f2c53b027178bb4e485d2781e266d96ff6e3d /docs
parent: 1b144455b620861d8cc790d3fc69902717f14524 (diff)
download: spark-024482bf51e8158eed08a7dc0758f585baf86e1f.tar.gz
spark-024482bf51e8158eed08a7dc0758f585baf86e1f.tar.bz2
spark-024482bf51e8158eed08a7dc0758f585baf86e1f.zip
14 files changed, 22 insertions, 22 deletions
diff --git a/docs/ml-classification-regression.md b/docs/ml-classification-regression.md
index 9569a06472..45155c8ad1 100644
--- a/docs/ml-classification-regression.md
+++ b/docs/ml-classification-regression.md
@@ -252,7 +252,7 @@ Nodes in the output layer use softmax function:
 \]`
 The number of nodes `$N$` in the output layer corresponds to the number of classes. 
 
-MLPC employes backpropagation for learning the model. We use logistic loss function for optimization and L-BFGS as optimization routine.
+MLPC employs backpropagation for learning the model. We use logistic loss function for optimization and L-BFGS as optimization routine.
 
 **Example**
 
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 5809f65d63..68d3ea2971 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -185,7 +185,7 @@ for more details on the API.
 <div data-lang="python" markdown="1">
 
 Refer to the [Tokenizer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.Tokenizer) and
-the the [RegexTokenizer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.RegexTokenizer)
+the [RegexTokenizer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.RegexTokenizer)
 for more details on the API.
 
 {% include_example python/ml/tokenizer_example.py %}
@@ -459,7 +459,7 @@ column, we should get the following:
 "a" gets index `0` because it is the most frequent, followed by "c" with index `1` and "b" with
 index `2`.
 
-Additionaly, there are two strategies regarding how `StringIndexer` will handle
+Additionally, there are two strategies regarding how `StringIndexer` will handle
 unseen labels when you have fit a `StringIndexer` on one dataset and then use it
 to transform another:
 
@@ -779,7 +779,7 @@ for more details on the API.
 
 * `splits`: Parameter for mapping continuous features into buckets. With n+1 splits, there are n buckets. A bucket defined by splits x,y holds values in the range [x,y) except the last bucket, which also includes y. Splits should be strictly increasing. Values at -inf, inf must be explicitly provided to cover all Double values; Otherwise, values outside the splits specified will be treated as errors. Two examples of `splits` are `Array(Double.NegativeInfinity, 0.0, 1.0, Double.PositiveInfinity)` and `Array(0.0, 1.0, 2.0)`.
 
-Note that if you have no idea of the upper bound and lower bound of the targeted column, you would better add the `Double.NegativeInfinity` and `Double.PositiveInfinity` as the bounds of your splits to prevent a potenial out of Bucketizer bounds exception.
+Note that if you have no idea of the upper bound and lower bound of the targeted column, you would better add the `Double.NegativeInfinity` and `Double.PositiveInfinity` as the bounds of your splits to prevent a potential out of Bucketizer bounds exception.
 
 Note also that the splits that you provided have to be in strictly increasing order, i.e. `s0 < s1 < s2 < ... < sn`.
 
diff --git a/docs/ml-guide.md b/docs/ml-guide.md
index 1770aabf6f..8eee2fb674 100644
--- a/docs/ml-guide.md
+++ b/docs/ml-guide.md
@@ -628,7 +628,7 @@ Currently, `spark.ml` supports model selection using the [`CrossValidator`](api/
 The `Evaluator` can be a [`RegressionEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.RegressionEvaluator)
 for regression problems, a [`BinaryClassificationEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.BinaryClassificationEvaluator)
 for binary data, or a [`MultiClassClassificationEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator)
-for multiclass problems. The default metric used to choose the best `ParamMap` can be overriden by the `setMetricName`
+for multiclass problems. The default metric used to choose the best `ParamMap` can be overridden by the `setMetricName`
 method in each of these evaluators.
 
 The `ParamMap` which produces the best evaluation metric (averaged over the `$k$` folds) is selected as the best model.
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index d0be032868..8e724fbf06 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -300,7 +300,7 @@ for i in range(2):
 ## Power iteration clustering (PIC)
 
 Power iteration clustering (PIC) is a scalable and efficient algorithm for clustering vertices of a
-graph given pairwise similarties as edge properties,
+graph given pairwise similarities as edge properties,
 described in [Lin and Cohen, Power Iteration Clustering](http://www.icml2010.org/papers/387.pdf).
 It computes a pseudo-eigenvector of the normalized affinity matrix of the graph via
 [power iteration](http://en.wikipedia.org/wiki/Power_iteration)  and uses it to cluster vertices.
@@ -786,7 +786,7 @@ This example shows how to estimate clusters on streaming data.
 <div data-lang="scala" markdown="1">
 Refer to the [`StreamingKMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.StreamingKMeans) for details on the API.
 
-First we import the neccessary classes.
+First we import the necessary classes.
 
 {% highlight scala %}
 
@@ -837,7 +837,7 @@ ssc.awaitTermination()
 <div data-lang="python" markdown="1">
 Refer to the [`StreamingKMeans` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.clustering.StreamingKMeans) for more details on the API.
 
-First we import the neccessary classes.
+First we import the necessary classes.
 
 {% highlight python %}
 from pyspark.mllib.linalg import Vectors
diff --git a/docs/mllib-evaluation-metrics.md b/docs/mllib-evaluation-metrics.md
index 774826c270..a269dbf030 100644
--- a/docs/mllib-evaluation-metrics.md
+++ b/docs/mllib-evaluation-metrics.md
@@ -67,7 +67,7 @@ plots (recall, false positive rate) points.
   </thead>
   <tbody>
     <tr>
-      <td>Precision (Postive Predictive Value)</td>
+      <td>Precision (Positive Predictive Value)</td>
       <td>$PPV=\frac{TP}{TP + FP}$</td>
     </tr>
     <tr>
@@ -360,7 +360,7 @@ $$I_A(x) = \begin{cases}1 & \text{if $x \in A$}, \\ 0 & \text{otherwise}.\end{ca
 
 **Examples**
 
-The following code snippets illustrate how to evaluate the performance of a multilabel classifer. The examples
+The following code snippets illustrate how to evaluate the performance of a multilabel classifier. The examples
 use the fake prediction and label data for multilabel classification that is shown below.
 
 Document predictions:
@@ -558,7 +558,7 @@ variable from a number of independent variables.
       <td>$RMSE = \sqrt{\frac{\sum_{i=0}^{N-1} (\mathbf{y}_i - \hat{\mathbf{y}}_i)^2}{N}}$</td>
     </tr>
     <tr>
-      <td>Mean Absoloute Error (MAE)</td>
+      <td>Mean Absolute Error (MAE)</td>
       <td>$MAE=\sum_{i=0}^{N-1} \left|\mathbf{y}_i - \hat{\mathbf{y}}_i\right|$</td>
     </tr>
     <tr>
diff --git a/docs/mllib-frequent-pattern-mining.md b/docs/mllib-frequent-pattern-mining.md
index 2c8a8f2361..a7b55dc5e5 100644
--- a/docs/mllib-frequent-pattern-mining.md
+++ b/docs/mllib-frequent-pattern-mining.md
@@ -135,7 +135,7 @@ pattern mining problem.
   included in the results.
 * `maxLocalProjDBSize`: the maximum number of items allowed in a
   prefix-projected database before local iterative processing of the
-  projected databse begins. This parameter should be tuned with respect
+  projected database begins. This parameter should be tuned with respect
   to the size of your executors.
 
 **Examples**
diff --git a/docs/monitoring.md b/docs/monitoring.md
index c37f6fb20d..c139e1cb5a 100644
--- a/docs/monitoring.md
+++ b/docs/monitoring.md
@@ -108,7 +108,7 @@ The history server can be configured as follows:
     <td>spark.history.fs.update.interval</td>
     <td>10s</td>
     <td>
-      The period at which the the filesystem history provider checks for new or
+      The period at which the filesystem history provider checks for new or
       updated logs in the log directory. A shorter interval detects new applications faster,
       at the expense of more server load re-reading updated applications.
       As soon as an update has completed, listings of the completed and incomplete applications
diff --git a/docs/programming-guide.md b/docs/programming-guide.md
index 2d6f7767d9..5ebafa40b0 100644
--- a/docs/programming-guide.md
+++ b/docs/programming-guide.md
@@ -629,7 +629,7 @@ class MyClass {
 }
 {% endhighlight %}
 
-is equilvalent to writing `rdd.map(x => this.field + x)`, which references all of `this`. To avoid this
+is equivalent to writing `rdd.map(x => this.field + x)`, which references all of `this`. To avoid this
 issue, the simplest way is to copy `field` into a local variable instead of accessing it externally:
 
 {% highlight scala %}
diff --git a/docs/running-on-mesos.md b/docs/running-on-mesos.md
index 35f6caab17..b9f64c7ed1 100644
--- a/docs/running-on-mesos.md
+++ b/docs/running-on-mesos.md
@@ -188,7 +188,7 @@ overhead, but at the cost of reserving the Mesos resources for the complete dura
 application.
 
 Coarse-grained is the default mode. You can also set `spark.mesos.coarse` property to true
-to turn it on explictly in [SparkConf](configuration.html#spark-properties):
+to turn it on explicitly in [SparkConf](configuration.html#spark-properties):
 
 {% highlight scala %}
 conf.set("spark.mesos.coarse", "true")
@@ -384,7 +384,7 @@ See the [configuration page](configuration.html) for information on Spark config
       <li>Scalar constraints are matched with "less than equal" semantics i.e. value in the constraint must be less than or equal to the value in the resource offer.</li>
       <li>Range constraints are matched with "contains" semantics i.e. value in the constraint must be within the resource offer's value.</li>
       <li>Set constraints are matched with "subset of" semantics i.e. value in the constraint must be a subset of the resource offer's value.</li>
-      <li>Text constraints are metched with "equality" semantics i.e. value in the constraint must be exactly equal to the resource offer's value.</li>
+      <li>Text constraints are matched with "equality" semantics i.e. value in the constraint must be exactly equal to the resource offer's value.</li>
       <li>In case there is no value present as a part of the constraint any offer with the corresponding attribute will be accepted (without value check).</li>
     </ul>
   </td>
diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md
index 3de72bc016..fd94c34d16 100644
--- a/docs/spark-standalone.md
+++ b/docs/spark-standalone.md
@@ -335,7 +335,7 @@ By default, standalone scheduling clusters are resilient to Worker failures (ins
 
 **Overview**
 
-Utilizing ZooKeeper to provide leader election and some state storage, you can launch multiple Masters in your cluster connected to the same ZooKeeper instance. One will be elected "leader" and the others will remain in standby mode. If the current leader dies, another Master will be elected, recover the old Master's state, and then resume scheduling. The entire recovery process (from the time the the first leader goes down) should take between 1 and 2 minutes. Note that this delay only affects scheduling _new_ applications -- applications that were already running during Master failover are unaffected.
+Utilizing ZooKeeper to provide leader election and some state storage, you can launch multiple Masters in your cluster connected to the same ZooKeeper instance. One will be elected "leader" and the others will remain in standby mode. If the current leader dies, another Master will be elected, recover the old Master's state, and then resume scheduling. The entire recovery process (from the time the first leader goes down) should take between 1 and 2 minutes. Note that this delay only affects scheduling _new_ applications -- applications that were already running during Master failover are unaffected.
 
 Learn more about getting started with ZooKeeper [here](http://zookeeper.apache.org/doc/trunk/zookeeperStarted.html).
 
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index d246100f3e..c4d277f9bf 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1372,7 +1372,7 @@ Hive metastore Parquet table to a Spark SQL Parquet table. The reconciliation ru
 1. The reconciled schema contains exactly those fields defined in Hive metastore schema.
 
    - Any fields that only appear in the Parquet schema are dropped in the reconciled schema.
-   - Any fileds that only appear in the Hive metastore schema are added as nullable field in the
+   - Any fields that only appear in the Hive metastore schema are added as nullable field in the
      reconciled schema.
 
 #### Metadata Refreshing
diff --git a/docs/streaming-flume-integration.md b/docs/streaming-flume-integration.md
index e2d589b843..8eeeee75db 100644
--- a/docs/streaming-flume-integration.md
+++ b/docs/streaming-flume-integration.md
@@ -30,7 +30,7 @@ See the [Flume's documentation](https://flume.apache.org/documentation.html) for
 configuring Flume agents.
 
 #### Configuring Spark Streaming Application
-1. **Linking:** In your SBT/Maven projrect definition, link your streaming application against the following artifact (see [Linking section](streaming-programming-guide.html#linking) in the main programming guide for further information).
+1. **Linking:** In your SBT/Maven project definition, link your streaming application against the following artifact (see [Linking section](streaming-programming-guide.html#linking) in the main programming guide for further information).
 
 		groupId = org.apache.spark
 		artifactId = spark-streaming-flume_{{site.SCALA_BINARY_VERSION}}
diff --git a/docs/streaming-kinesis-integration.md b/docs/streaming-kinesis-integration.md
index 5f5e2b9087..2a868e8bca 100644
--- a/docs/streaming-kinesis-integration.md
+++ b/docs/streaming-kinesis-integration.md
@@ -95,7 +95,7 @@ A Kinesis stream can be set up at one of the valid Kinesis endpoints with 1 or m
 	</div>
 	</div>
 
-	- `streamingContext`: StreamingContext containg an application name used by Kinesis to tie this Kinesis application to the Kinesis stream
+	- `streamingContext`: StreamingContext containing an application name used by Kinesis to tie this Kinesis application to the Kinesis stream
 
 	- `[Kinesis app name]`: The application name that will be used to checkpoint the Kinesis
 		sequence numbers in DynamoDB table.
@@ -216,6 +216,6 @@ de-aggregate records during consumption.
 
 - Checkpointing too frequently will cause excess load on the AWS checkpoint storage layer and may lead to AWS throttling.  The provided example handles this throttling with a random-backoff-retry strategy.
 
-- If no Kinesis checkpoint info exists when the input DStream starts, it will start either from the oldest record available (InitialPositionInStream.TRIM_HORIZON) or from the latest tip (InitialPostitionInStream.LATEST).  This is configurable.
+- If no Kinesis checkpoint info exists when the input DStream starts, it will start either from the oldest record available (InitialPositionInStream.TRIM_HORIZON) or from the latest tip (InitialPositionInStream.LATEST).  This is configurable.
 - InitialPositionInStream.LATEST could lead to missed records if data is added to the stream while no input DStreams are running (and no checkpoint info is being stored).
 - InitialPositionInStream.TRIM_HORIZON may lead to duplicate processing of records where the impact is dependent on checkpoint frequency and processing idempotency.
diff --git a/docs/streaming-programming-guide.md b/docs/streaming-programming-guide.md
index 4d1932bc8c..5d67a0a9a9 100644
--- a/docs/streaming-programming-guide.md
+++ b/docs/streaming-programming-guide.md
@@ -158,7 +158,7 @@ JavaReceiverInputDStream<String> lines = jssc.socketTextStream("localhost", 9999
 {% endhighlight %}
 
 This `lines` DStream represents the stream of data that will be received from the data
-server. Each record in this stream is a line of text. Then, we want to split the the lines by
+server. Each record in this stream is a line of text. Then, we want to split the lines by
 space into words.
 
 {% highlight java %}
author	Dongjoon Hyun <dongjoon@apache.org>	2016-02-22 09:52:07 +0000
committer	Sean Owen <sowen@cloudera.com>	2016-02-22 09:52:07 +0000
commit	024482bf51e8158eed08a7dc0758f585baf86e1f (patch)
tree	e51f2c53b027178bb4e485d2781e266d96ff6e3d /docs
parent	1b144455b620861d8cc790d3fc69902717f14524 (diff)
download	spark-024482bf51e8158eed08a7dc0758f585baf86e1f.tar.gz spark-024482bf51e8158eed08a7dc0758f585baf86e1f.tar.bz2 spark-024482bf51e8158eed08a7dc0758f585baf86e1f.zip