aboutsummaryrefslogtreecommitdiff
path: root/docs/ml-features.md
diff options
context:
space:
mode:
authorYanbo Liang <ybliang8@gmail.com>2015-12-07 23:50:57 -0800
committerXiangrui Meng <meng@databricks.com>2015-12-07 23:50:57 -0800
commit4a39b5a1bee28cec792d509654f6236390cafdcb (patch)
tree1637657b13ee5294d74abf8f3f2f4c3f5bf9ba86 /docs/ml-features.md
parent7d05a624510f7299b3dd07f87c203db1ff7caa3e (diff)
downloadspark-4a39b5a1bee28cec792d509654f6236390cafdcb.tar.gz
spark-4a39b5a1bee28cec792d509654f6236390cafdcb.tar.bz2
spark-4a39b5a1bee28cec792d509654f6236390cafdcb.zip
[SPARK-11958][SPARK-11957][ML][DOC] SQLTransformer user guide and example code
Add ```SQLTransformer``` user guide, example code and make Scala API doc more clear. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10006 from yanboliang/spark-11958.
Diffstat (limited to 'docs/ml-features.md')
-rw-r--r--docs/ml-features.md59
1 files changed, 59 insertions, 0 deletions
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 5105a948fe..f85e0d56d2 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -756,6 +756,65 @@ for more details on the API.
</div>
</div>
+## SQLTransformer
+
+`SQLTransformer` implements the transformations which are defined by SQL statement.
+Currently we only support SQL syntax like `"SELECT ... FROM __THIS__ ..."`
+where `"__THIS__"` represents the underlying table of the input dataset.
+The select clause specifies the fields, constants, and expressions to display in
+the output, it can be any select clause that Spark SQL supports. Users can also
+use Spark SQL built-in function and UDFs to operate on these selected columns.
+For example, `SQLTransformer` supports statements like:
+
+* `SELECT a, a + b AS a_b FROM __THIS__`
+* `SELECT a, SQRT(b) AS b_sqrt FROM __THIS__ where a > 5`
+* `SELECT a, b, SUM(c) AS c_sum FROM __THIS__ GROUP BY a, b`
+
+**Examples**
+
+Assume that we have the following DataFrame with columns `id`, `v1` and `v2`:
+
+~~~~
+ id | v1 | v2
+----|-----|-----
+ 0 | 1.0 | 3.0
+ 2 | 2.0 | 5.0
+~~~~
+
+This is the output of the `SQLTransformer` with statement `"SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__"`:
+
+~~~~
+ id | v1 | v2 | v3 | v4
+----|-----|-----|-----|-----
+ 0 | 1.0 | 3.0 | 4.0 | 3.0
+ 2 | 2.0 | 5.0 | 7.0 |10.0
+~~~~
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+Refer to the [SQLTransformer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.SQLTransformer)
+for more details on the API.
+
+{% include_example scala/org/apache/spark/examples/ml/SQLTransformerExample.scala %}
+</div>
+
+<div data-lang="java" markdown="1">
+
+Refer to the [SQLTransformer Java docs](api/java/org/apache/spark/ml/feature/SQLTransformer.html)
+for more details on the API.
+
+{% include_example java/org/apache/spark/examples/ml/JavaSQLTransformerExample.java %}
+</div>
+
+<div data-lang="python" markdown="1">
+
+Refer to the [SQLTransformer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.SQLTransformer) for more details on the API.
+
+{% include_example python/ml/sql_transformer.py %}
+</div>
+</div>
+
## VectorAssembler
`VectorAssembler` is a transformer that combines a given list of columns into a single vector