aboutsummaryrefslogtreecommitdiff
path: root/docs/ml-features.md
diff options
context:
space:
mode:
authorsethah <seth.hendrickson16@gmail.com>2016-05-19 23:29:37 -0700
committerXiangrui Meng <meng@databricks.com>2016-05-19 23:29:37 -0700
commit5e203505f1a092e5849ebd01d9ff9e4fc6cdc34a (patch)
tree8b59b210bfaadfa89d922fe98ea93c0687c8da07 /docs/ml-features.md
parent47a2940da97caa55bbb8bb8ec1d51c9f6d5041c6 (diff)
downloadspark-5e203505f1a092e5849ebd01d9ff9e4fc6cdc34a.tar.gz
spark-5e203505f1a092e5849ebd01d9ff9e4fc6cdc34a.tar.bz2
spark-5e203505f1a092e5849ebd01d9ff9e4fc6cdc34a.zip
[SPARK-15394][ML][DOCS] User guide typos and grammar audit
## What changes were proposed in this pull request? Correct some typos and incorrectly worded sentences. ## How was this patch tested? Doc changes only. Note that many of these changes were identified by whomfire01 Author: sethah <seth.hendrickson16@gmail.com> Closes #13180 from sethah/ml_guide_audit.
Diffstat (limited to 'docs/ml-features.md')
-rw-r--r--docs/ml-features.md47
1 files changed, 23 insertions, 24 deletions
diff --git a/docs/ml-features.md b/docs/ml-features.md
index c44ace91f2..3db24a3840 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -26,7 +26,7 @@ to a document in the corpus. Denote a term by `$t$`, a document by `$d$`, and th
Term frequency `$TF(t, d)$` is the number of times that term `$t$` appears in document `$d$`, while
document frequency `$DF(t, D)$` is the number of documents that contains term `$t$`. If we only use
term frequency to measure the importance, it is very easy to over-emphasize terms that appear very
-often but carry little information about the document, e.g., "a", "the", and "of". If a term appears
+often but carry little information about the document, e.g. "a", "the", and "of". If a term appears
very often across the corpus, it means it doesn't carry special information about a particular document.
Inverse document frequency is a numerical measure of how much information a term provides:
`\[
@@ -50,7 +50,7 @@ A raw feature is mapped into an index (term) by applying a hash function. Then t
are calculated based on the mapped indices. This approach avoids the need to compute a global
term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash
collisions, where different raw features may become the same term after hashing. To reduce the
-chance of collision, we can increase the target feature dimension, i.e., the number of buckets
+chance of collision, we can increase the target feature dimension, i.e. the number of buckets
of the hash table. Since a simple modulo is used to transform the hash function to a column index,
it is advisable to use a power of two as the feature dimension, otherwise the features will
not be mapped evenly to the columns. The default feature dimension is `$2^{18} = 262,144$`.
@@ -104,7 +104,7 @@ the [IDF Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.IDF) for mor
`Word2Vec` is an `Estimator` which takes sequences of words representing documents and trains a
`Word2VecModel`. The model maps each word to a unique fixed-size vector. The `Word2VecModel`
transforms each document into a vector using the average of all words in the document; this vector
-can then be used for as features for prediction, document similarity calculations, etc.
+can then be used as features for prediction, document similarity calculations, etc.
Please refer to the [MLlib user guide on Word2Vec](mllib-feature-extraction.html#word2vec) for more
details.
@@ -140,12 +140,12 @@ for more details on the API.
`CountVectorizer` and `CountVectorizerModel` aim to help convert a collection of text documents
to vectors of token counts. When an a-priori dictionary is not available, `CountVectorizer` can
- be used as an `Estimator` to extract the vocabulary and generates a `CountVectorizerModel`. The
+ be used as an `Estimator` to extract the vocabulary, and generates a `CountVectorizerModel`. The
model produces sparse representations for the documents over the vocabulary, which can then be
passed to other algorithms like LDA.
During the fitting process, `CountVectorizer` will select the top `vocabSize` words ordered by
- term frequency across the corpus. An optional parameter "minDF" also affect the fitting process
+ term frequency across the corpus. An optional parameter "minDF" also affects the fitting process
by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be
included in the vocabulary.
@@ -161,8 +161,8 @@ Assume that we have the following DataFrame with columns `id` and `texts`:
~~~~
each row in `texts` is a document of type Array[String].
-Invoking fit of `CountVectorizer` produces a `CountVectorizerModel` with vocabulary (a, b, c),
-then the output column "vector" after transformation contains:
+Invoking fit of `CountVectorizer` produces a `CountVectorizerModel` with vocabulary (a, b, c).
+Then the output column "vector" after transformation contains:
~~~~
id | texts | vector
@@ -171,7 +171,7 @@ then the output column "vector" after transformation contains:
1 | Array("a", "b", "b", "c", "a") | (3,[0,1,2],[2.0,2.0,1.0])
~~~~
-each vector represents the token counts of the document over the vocabulary.
+Each vector represents the token counts of the document over the vocabulary.
<div class="codetabs">
<div data-lang="scala" markdown="1">
@@ -477,8 +477,7 @@ for more details on the API.
## StringIndexer
`StringIndexer` encodes a string column of labels to a column of label indices.
-The indices are in `[0, numLabels)`, ordered by label frequencies.
-So the most frequent label gets index `0`.
+The indices are in `[0, numLabels)`, ordered by label frequencies, so the most frequent label gets index `0`.
If the input column is numeric, we cast it to string and index the string
values. When downstream pipeline components such as `Estimator` or
`Transformer` make use of this string-indexed label, you must set the input
@@ -585,7 +584,7 @@ for more details on the API.
## IndexToString
Symmetrically to `StringIndexer`, `IndexToString` maps a column of label indices
-back to a column containing the original labels as strings. The common use case
+back to a column containing the original labels as strings. A common use case
is to produce indices from labels with `StringIndexer`, train a model with those
indices and retrieve the original labels from the column of predicted indices
with `IndexToString`. However, you are free to supply your own labels.
@@ -652,7 +651,7 @@ for more details on the API.
## OneHotEncoder
-[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features
+[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.
<div class="codetabs">
<div data-lang="scala" markdown="1">
@@ -888,7 +887,7 @@ for more details on the API.
* `splits`: Parameter for mapping continuous features into buckets. With n+1 splits, there are n buckets. A bucket defined by splits x,y holds values in the range [x,y) except the last bucket, which also includes y. Splits should be strictly increasing. Values at -inf, inf must be explicitly provided to cover all Double values; Otherwise, values outside the splits specified will be treated as errors. Two examples of `splits` are `Array(Double.NegativeInfinity, 0.0, 1.0, Double.PositiveInfinity)` and `Array(0.0, 1.0, 2.0)`.
-Note that if you have no idea of the upper bound and lower bound of the targeted column, you would better add the `Double.NegativeInfinity` and `Double.PositiveInfinity` as the bounds of your splits to prevent a potential out of Bucketizer bounds exception.
+Note that if you have no idea of the upper and lower bounds of the targeted column, you should add `Double.NegativeInfinity` and `Double.PositiveInfinity` as the bounds of your splits to prevent a potential out of Bucketizer bounds exception.
Note also that the splits that you provided have to be in strictly increasing order, i.e. `s0 < s1 < s2 < ... < sn`.
@@ -976,7 +975,7 @@ for more details on the API.
Currently we only support SQL syntax like `"SELECT ... FROM __THIS__ ..."`
where `"__THIS__"` represents the underlying table of the input dataset.
The select clause specifies the fields, constants, and expressions to display in
-the output, it can be any select clause that Spark SQL supports. Users can also
+the output, and can be any select clause that Spark SQL supports. Users can also
use Spark SQL built-in function and UDFs to operate on these selected columns.
For example, `SQLTransformer` supports statements like:
@@ -1121,7 +1120,7 @@ Assume that we have a DataFrame with the columns `id`, `hour`:
~~~
`hour` is a continuous feature with `Double` type. We want to turn the continuous feature into
-categorical one. Given `numBuckets = 3`, we should get the following DataFrame:
+a categorical one. Given `numBuckets = 3`, we should get the following DataFrame:
~~~
id | hour | result
@@ -1171,19 +1170,19 @@ for more details on the API.
`VectorSlicer` is a transformer that takes a feature vector and outputs a new feature vector with a
sub-array of the original features. It is useful for extracting features from a vector column.
-`VectorSlicer` accepts a vector column with a specified indices, then outputs a new vector column
+`VectorSlicer` accepts a vector column with specified indices, then outputs a new vector column
whose values are selected via those indices. There are two types of indices,
- 1. Integer indices that represents the indices into the vector, `setIndices()`;
+ 1. Integer indices that represent the indices into the vector, `setIndices()`.
- 2. String indices that represents the names of features into the vector, `setNames()`.
+ 2. String indices that represent the names of features into the vector, `setNames()`.
*This requires the vector column to have an `AttributeGroup` since the implementation matches on
the name field of an `Attribute`.*
Specification by integer and string are both acceptable. Moreover, you can use integer index and
string name simultaneously. At least one feature must be selected. Duplicate features are not
allowed, so there can be no overlap between selected indices and names. Note that if names of
-features are selected, an exception will be threw out when encountering with empty input attributes.
+features are selected, an exception will be thrown if empty input attributes are encountered.
The output vector will order features with the selected indices first (in the order given),
followed by the selected names (in the order given).
@@ -1198,8 +1197,8 @@ Suppose that we have a DataFrame with the column `userFeatures`:
[0.0, 10.0, 0.5]
~~~
-`userFeatures` is a vector column that contains three user features. Assuming that the first column
-of `userFeatures` are all zeros, so we want to remove it and only the last two columns are selected.
+`userFeatures` is a vector column that contains three user features. Assume that the first column
+of `userFeatures` are all zeros, so we want to remove it and select only the last two columns.
The `VectorSlicer` selects the last two elements with `setIndices(1, 2)` then produces a new vector
column named `features`:
@@ -1209,7 +1208,7 @@ column named `features`:
[0.0, 10.0, 0.5] | [10.0, 0.5]
~~~
-Suppose also that we have a potential input attributes for the `userFeatures`, i.e.
+Suppose also that we have potential input attributes for the `userFeatures`, i.e.
`["f1", "f2", "f3"]`, then we can use `setNames("f2", "f3")` to select them.
~~~
@@ -1337,8 +1336,8 @@ id | features | clicked
9 | [1.0, 0.0, 15.0, 0.1] | 0.0
~~~
-If we use `ChiSqSelector` with a `numTopFeatures = 1`, then according to our label `clicked` the
-last column in our `features` chosen as the most useful feature:
+If we use `ChiSqSelector` with `numTopFeatures = 1`, then according to our label `clicked` the
+last column in our `features` is chosen as the most useful feature:
~~~
id | features | clicked | selectedFeatures