aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorWenchen Fan <wenchen@databricks.com>2017-01-05 17:40:27 -0800
committerYin Huai <yhuai@databricks.com>2017-01-05 17:40:27 -0800
commitcca945b6aa679e61864c1cabae91e6ae7703362e (patch)
treef7b33ef60fc92237503fb911270b8bedba76815b /docs
parentf5d18af6a8a0b9f8c2e9677f9d8ae1712eb701c6 (diff)
downloadspark-cca945b6aa679e61864c1cabae91e6ae7703362e.tar.gz
spark-cca945b6aa679e61864c1cabae91e6ae7703362e.tar.bz2
spark-cca945b6aa679e61864c1cabae91e6ae7703362e.zip
[SPARK-18885][SQL] unify CREATE TABLE syntax for data source and hive serde tables
## What changes were proposed in this pull request? Today we have different syntax to create data source or hive serde tables, we should unify them to not confuse users and step forward to make hive a data source. Please read https://issues.apache.org/jira/secure/attachment/12843835/CREATE-TABLE.pdf for details. TODO(for follow-up PRs): 1. TBLPROPERTIES is not added to the new syntax, we should decide if we wanna add it later. 2. `SHOW CREATE TABLE` should be updated to use the new syntax. 3. we should decide if we wanna change the behavior of `SET LOCATION`. ## How was this patch tested? new tests Author: Wenchen Fan <wenchen@databricks.com> Closes #16296 from cloud-fan/create-table.
Diffstat (limited to 'docs')
-rw-r--r--docs/sql-programming-guide.md60
1 files changed, 52 insertions, 8 deletions
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 4cd21aef91..0f6e344655 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -522,14 +522,11 @@ Hive metastore. Persistent tables will still exist even after your Spark program
long as you maintain your connection to the same metastore. A DataFrame for a persistent table can
be created by calling the `table` method on a `SparkSession` with the name of the table.
-By default `saveAsTable` will create a "managed table", meaning that the location of the data will
-be controlled by the metastore. Managed tables will also have their data deleted automatically
-when a table is dropped.
-
-Currently, `saveAsTable` does not expose an API supporting the creation of an "external table" from a `DataFrame`.
-However, this functionality can be achieved by providing a `path` option to the `DataFrameWriter` with `path` as the key
-and location of the external table as its value (a string) when saving the table with `saveAsTable`. When an External table
-is dropped only its metadata is removed.
+For file-based data source, e.g. text, parquet, json, etc. you can specify a custom table path via the
+`path` option, e.g. `df.write.option("path", "/some/path").saveAsTable("t")`. When the table is dropped,
+the custom table path will not be removed and the table data is still there. If no custom table path is
+specifed, Spark will write data to a default table path under the warehouse directory. When the table is
+dropped, the default table path will be removed too.
Starting from Spark 2.1, persistent datasource tables have per-partition metadata stored in the Hive metastore. This brings several benefits:
@@ -954,6 +951,53 @@ adds support for finding tables in the MetaStore and writing queries using HiveQ
</div>
</div>
+### Specifying storage format for Hive tables
+
+When you create a Hive table, you need to define how this table should read/write data from/to file system,
+i.e. the "input format" and "output format". You also need to define how this table should deserialize the data
+to rows, or serialize rows to data, i.e. the "serde". The following options can be used to specify the storage
+format("serde", "input format", "output format"), e.g. `CREATE TABLE src(id int) USING hive OPTIONS(fileFormat 'parquet')`.
+By default, we will read the table files as plain text. Note that, Hive storage handler is not supported yet when
+creating table, you can create a table using storage handler at Hive side, and use Spark SQL to read it.
+
+<table class="table">
+ <tr><th>Property Name</th><th>Meaning</th></tr>
+ <tr>
+ <td><code>fileFormat</code></td>
+ <td>
+ A fileFormat is kind of a package of storage format specifications, including "serde", "input format" and
+ "output format". Currently we support 6 fileFormats: 'sequencefile', 'rcfile', 'orc', 'parquet', 'textfile' and 'avro'.
+ </td>
+ </tr>
+
+ <tr>
+ <td><code>inputFormat, outputFormat</code></td>
+ <td>
+ These 2 options specify the name of a corresponding `InputFormat` and `OutputFormat` class as a string literal,
+ e.g. `org.apache.hadoop.hive.ql.io.orc.OrcInputFormat`. These 2 options must be appeared in pair, and you can not
+ specify them if you already specified the `fileFormat` option.
+ </td>
+ </tr>
+
+ <tr>
+ <td><code>serde</code></td>
+ <td>
+ This option specifies the name of a serde class. When the `fileFormat` option is specified, do not specify this option
+ if the given `fileFormat` already include the information of serde. Currently "sequencefile", "textfile" and "rcfile"
+ don't include the serde information and you can use this option with these 3 fileFormats.
+ </td>
+ </tr>
+
+ <tr>
+ <td><code>fieldDelim, escapeDelim, collectionDelim, mapkeyDelim, lineDelim</code></td>
+ <td>
+ These options can only be used with "textfile" fileFormat. They define how to read delimited files into rows.
+ </td>
+ </tr>
+</table>
+
+All other properties defined with `OPTIONS` will be regarded as Hive serde properties.
+
### Interacting with Different Versions of Hive Metastore
One of the most important pieces of Spark SQL's Hive support is interaction with Hive metastore,