aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--docs/sql-programming-guide.md15
1 files changed, 11 insertions, 4 deletions
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 6287e2be95..4cd21aef91 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -515,7 +515,7 @@ new data.
### Saving to Persistent Tables
`DataFrames` can also be saved as persistent tables into Hive metastore using the `saveAsTable`
-command. Notice existing Hive deployment is not necessary to use this feature. Spark will create a
+command. Notice that an existing Hive deployment is not necessary to use this feature. Spark will create a
default local Hive metastore (using Derby) for you. Unlike the `createOrReplaceTempView` command,
`saveAsTable` will materialize the contents of the DataFrame and create a pointer to the data in the
Hive metastore. Persistent tables will still exist even after your Spark program has restarted, as
@@ -526,11 +526,18 @@ By default `saveAsTable` will create a "managed table", meaning that the locatio
be controlled by the metastore. Managed tables will also have their data deleted automatically
when a table is dropped.
-Currently, `saveAsTable` does not expose an API supporting the creation of an "External table" from a `DataFrame`,
-however, this functionality can be achieved by providing a `path` option to the `DataFrameWriter` with `path` as the key
-and location of the external table as its value (String) when saving the table with `saveAsTable`. When an External table
+Currently, `saveAsTable` does not expose an API supporting the creation of an "external table" from a `DataFrame`.
+However, this functionality can be achieved by providing a `path` option to the `DataFrameWriter` with `path` as the key
+and location of the external table as its value (a string) when saving the table with `saveAsTable`. When an External table
is dropped only its metadata is removed.
+Starting from Spark 2.1, persistent datasource tables have per-partition metadata stored in the Hive metastore. This brings several benefits:
+
+- Since the metastore can return only necessary partitions for a query, discovering all the partitions on the first query to the table is no longer needed.
+- Hive DDLs such as `ALTER TABLE PARTITION ... SET LOCATION` are now available for tables created with the Datasource API.
+
+Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.
+
## Parquet Files
[Parquet](http://parquet.io) is a columnar format that is supported by many other data processing systems.