aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorMichael Armbrust <michael@databricks.com>2015-08-27 11:45:15 -0700
committerMichael Armbrust <michael@databricks.com>2015-08-27 11:45:28 -0700
commit965b3bb0ee4171f2c2533c0623f2cd680d700a2b (patch)
tree7aa8c52daf26e3400997459f3d6357c39edfb143
parent30f0f7e4e39b58091e0a10199b6da81d14fa7fdb (diff)
downloadspark-965b3bb0ee4171f2c2533c0623f2cd680d700a2b.tar.gz
spark-965b3bb0ee4171f2c2533c0623f2cd680d700a2b.tar.bz2
spark-965b3bb0ee4171f2c2533c0623f2cd680d700a2b.zip
[SPARK-9148] [SPARK-10252] [SQL] Update SQL Programming Guide
Author: Michael Armbrust <michael@databricks.com> Closes #8441 from marmbrus/documentation. (cherry picked from commit dc86a227e4fc8a9d8c3e8c68da8dff9298447fd0) Signed-off-by: Michael Armbrust <michael@databricks.com>
-rw-r--r--docs/sql-programming-guide.md92
1 files changed, 73 insertions, 19 deletions
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index e64190b9b2..99fec6c778 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -11,7 +11,7 @@ title: Spark SQL and DataFrames
Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.
-For how to enable Hive support, please refer to the [Hive Tables](#hive-tables) section.
+Spark SQL can also be used to read data from an existing Hive installation. For more on how to configure this feature, please refer to the [Hive Tables](#hive-tables) section.
# DataFrames
@@ -213,6 +213,11 @@ df.groupBy("age").count().show()
// 30 1
{% endhighlight %}
+For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/scala/index.html#org.apache.spark.sql.DataFrame).
+
+In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/scala/index.html#org.apache.spark.sql.DataFrame).
+
+
</div>
<div data-lang="java" markdown="1">
@@ -263,6 +268,10 @@ df.groupBy("age").count().show();
// 30 1
{% endhighlight %}
+For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/java/org/apache/spark/sql/DataFrame.html).
+
+In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/java/org/apache/spark/sql/functions.html).
+
</div>
<div data-lang="python" markdown="1">
@@ -320,6 +329,10 @@ df.groupBy("age").count().show()
{% endhighlight %}
+For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/python/pyspark.sql.html#pyspark.sql.DataFrame).
+
+In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/python/pyspark.sql.html#module-pyspark.sql.functions).
+
</div>
<div data-lang="r" markdown="1">
@@ -370,10 +383,13 @@ showDF(count(groupBy(df, "age")))
{% endhighlight %}
-</div>
+For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/R/index.html).
+
+In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/R/index.html).
</div>
+</div>
## Running SQL Queries Programmatically
@@ -870,12 +886,11 @@ saveDF(select(df, "name", "age"), "namesAndAges.parquet", "parquet")
Save operations can optionally take a `SaveMode`, that specifies how to handle existing data if
present. It is important to realize that these save modes do not utilize any locking and are not
-atomic. Thus, it is not safe to have multiple writers attempting to write to the same location.
-Additionally, when performing a `Overwrite`, the data will be deleted before writing out the
+atomic. Additionally, when performing a `Overwrite`, the data will be deleted before writing out the
new data.
<table class="table">
-<tr><th>Scala/Java</th><th>Python</th><th>Meaning</th></tr>
+<tr><th>Scala/Java</th><th>Any Language</th><th>Meaning</th></tr>
<tr>
<td><code>SaveMode.ErrorIfExists</code> (default)</td>
<td><code>"error"</code> (default)</td>
@@ -1671,12 +1686,12 @@ results <- collect(sql(sqlContext, "FROM src SELECT key, value"))
### Interacting with Different Versions of Hive Metastore
One of the most important pieces of Spark SQL's Hive support is interaction with Hive metastore,
-which enables Spark SQL to access metadata of Hive tables. Starting from Spark 1.4.0, a single binary build of Spark SQL can be used to query different versions of Hive metastores, using the configuration described below.
+which enables Spark SQL to access metadata of Hive tables. Starting from Spark 1.4.0, a single binary
+build of Spark SQL can be used to query different versions of Hive metastores, using the configuration described below.
+Note that independent of the version of Hive that is being used to talk to the metastore, internally Spark SQL
+will compile against Hive 1.2.1 and use those classes for internal execution (serdes, UDFs, UDAFs, etc).
-Internally, Spark SQL uses two Hive clients, one for executing native Hive commands like `SET`
-and `DESCRIBE`, the other dedicated for communicating with Hive metastore. The former uses Hive
-jars of version 0.13.1, which are bundled with Spark 1.4.0. The latter uses Hive jars of the
-version specified by users. An isolated classloader is used here to avoid dependency conflicts.
+The following options can be used to configure the version of Hive that is used to retrieve metadata:
<table class="table">
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
@@ -1685,7 +1700,7 @@ version specified by users. An isolated classloader is used here to avoid depend
<td><code>0.13.1</code></td>
<td>
Version of the Hive metastore. Available
- options are <code>0.12.0</code> and <code>0.13.1</code>. Support for more versions is coming in the future.
+ options are <code>0.12.0</code> through <code>1.2.1</code>.
</td>
</tr>
<tr>
@@ -1696,12 +1711,16 @@ version specified by users. An isolated classloader is used here to avoid depend
property can be one of three options:
<ol>
<li><code>builtin</code></li>
- Use Hive 0.13.1, which is bundled with the Spark assembly jar when <code>-Phive</code> is
+ Use Hive 1.2.1, which is bundled with the Spark assembly jar when <code>-Phive</code> is
enabled. When this option is chosen, <code>spark.sql.hive.metastore.version</code> must be
- either <code>0.13.1</code> or not defined.
+ either <code>1.2.1</code> or not defined.
<li><code>maven</code></li>
- Use Hive jars of specified version downloaded from Maven repositories.
- <li>A classpath in the standard format for both Hive and Hadoop.</li>
+ Use Hive jars of specified version downloaded from Maven repositories. This configuration
+ is not generally recommended for production deployments.
+ <li>A classpath in the standard format for the JVM. This classpath must include all of Hive
+ and its dependencies, including the correct version of Hadoop. These jars only need to be
+ present on the driver, but if you are running in yarn cluster mode then you must ensure
+ they are packaged with you application.</li>
</ol>
</td>
</tr>
@@ -2017,6 +2036,28 @@ options.
# Migration Guide
+## Upgrading From Spark SQL 1.4 to 1.5
+
+ - Optimized execution using manually managed memory (Tungsten) is now enabled by default, along with
+ code generation for expression evaluation. These features can both be disabled by setting
+ `spark.sql.tungsten.enabled` to `false.
+ - Parquet schema merging is no longer enabled by default. It can be re-enabled by setting
+ `spark.sql.parquet.mergeSchema` to `true`.
+ - Resolution of strings to columns in python now supports using dots (`.`) to qualify the column or
+ access nested values. For example `df['table.column.nestedField']`. However, this means that if
+ your column name contains any dots you must now escape them using backticks (e.g., ``table.`column.with.dots`.nested``).
+ - In-memory columnar storage partition pruning is on by default. It can be disabled by setting
+ `spark.sql.inMemoryColumnarStorage.partitionPruning` to `false`.
+ - Unlimited precision decimal columns are no longer supported, instead Spark SQL enforces a maximum
+ precision of 38. When inferring schema from `BigDecimal` objects, a precision of (38, 18) is now
+ used. When no precision is specified in DDL then the default remains `Decimal(10, 0)`.
+ - Timestamps are now stored at a precision of 1us, rather than 1ns
+ - In the `sql` dialect, floating point numbers are now parsed as decimal. HiveQL parsing remains
+ unchanged.
+ - The canonical name of SQL/DataFrame functions are now lower case (e.g. sum vs SUM).
+ - It has been determined that using the DirectOutputCommitter when speculation is enabled is unsafe
+ and thus this output committer will not be used when speculation is on, independent of configuration.
+
## Upgrading from Spark SQL 1.3 to 1.4
#### DataFrame data reader/writer interface
@@ -2038,7 +2079,8 @@ See the API docs for `SQLContext.read` (
#### DataFrame.groupBy retains grouping columns
-Based on user feedback, we changed the default behavior of `DataFrame.groupBy().agg()` to retain the grouping columns in the resulting `DataFrame`. To keep the behavior in 1.3, set `spark.sql.retainGroupColumns` to `false`.
+Based on user feedback, we changed the default behavior of `DataFrame.groupBy().agg()` to retain the
+grouping columns in the resulting `DataFrame`. To keep the behavior in 1.3, set `spark.sql.retainGroupColumns` to `false`.
<div class="codetabs">
<div data-lang="scala" markdown="1">
@@ -2175,7 +2217,7 @@ Python UDF registration is unchanged.
When using DataTypes in Python you will need to construct them (i.e. `StringType()`) instead of
referencing a singleton.
-## Migration Guide for Shark User
+## Migration Guide for Shark Users
### Scheduling
To set a [Fair Scheduler](job-scheduling.html#fair-scheduler-pools) pool for a JDBC client session,
@@ -2251,6 +2293,7 @@ Spark SQL supports the vast majority of Hive features, such as:
* User defined functions (UDF)
* User defined aggregation functions (UDAF)
* User defined serialization formats (SerDes)
+* Window functions
* Joins
* `JOIN`
* `{LEFT|RIGHT|FULL} OUTER JOIN`
@@ -2261,7 +2304,7 @@ Spark SQL supports the vast majority of Hive features, such as:
* `SELECT col FROM ( SELECT a + b AS col from t1) t2`
* Sampling
* Explain
-* Partitioned tables
+* Partitioned tables including dynamic partition insertion
* View
* All Hive DDL Functions, including:
* `CREATE TABLE`
@@ -2323,8 +2366,9 @@ releases of Spark SQL.
Hive can optionally merge the small files into fewer large files to avoid overflowing the HDFS
metadata. Spark SQL does not support that.
+# Reference
-# Data Types
+## Data Types
Spark SQL and DataFrames support the following data types:
@@ -2937,3 +2981,13 @@ from pyspark.sql.types import *
</div>
+## NaN Semantics
+
+There is specially handling for not-a-number (NaN) when dealing with `float` or `double` types that
+does not exactly match standard floating point semantics.
+Specifically:
+
+ - NaN = NaN returns true.
+ - In aggregations all NaN values are grouped together.
+ - NaN is treated as a normal value in join keys.
+ - NaN values go last when in ascending order, larger than any other numeric value.