aboutsummaryrefslogtreecommitdiff
path: root/docs/sql-programming-guide.md
diff options
context:
space:
mode:
authorCheng Lian <lian@databricks.com>2015-06-23 14:19:21 -0700
committerCheng Lian <lian@databricks.com>2015-06-23 14:19:21 -0700
commitd96d7b55746cf034e3935ec4b22614a99e48c498 (patch)
tree730159c3fd40f02bffae497d89b942e0157b483b /docs/sql-programming-guide.md
parenta8031183aff2e23de9204ddfc7e7f5edbf052a7e (diff)
downloadspark-d96d7b55746cf034e3935ec4b22614a99e48c498.tar.gz
spark-d96d7b55746cf034e3935ec4b22614a99e48c498.tar.bz2
spark-d96d7b55746cf034e3935ec4b22614a99e48c498.zip
[DOC] [SQL] Addes Hive metastore Parquet table conversion section
This PR adds a section about Hive metastore Parquet table conversion. It documents: 1. Schema reconciliation rules introduced in #5214 (see [this comment] [1] in #5188) 2. Metadata refreshing requirement introduced in #5339 [1]: https://github.com/apache/spark/pull/5188#issuecomment-86531248 Author: Cheng Lian <lian@databricks.com> Closes #5348 from liancheng/sql-doc-parquet-conversion and squashes the following commits: 42ae0d0 [Cheng Lian] Adds Python `refreshTable` snippet 4c9847d [Cheng Lian] Resorts to SQL for Python metadata refreshing snippet 756e660 [Cheng Lian] Adds Python snippet for metadata refreshing 50675db [Cheng Lian] Addes Hive metastore Parquet table conversion section
Diffstat (limited to 'docs/sql-programming-guide.md')
-rw-r--r--docs/sql-programming-guide.md94
1 files changed, 88 insertions, 6 deletions
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 26c036f664..9107c9b676 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -22,7 +22,7 @@ The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.
All of the examples on this page use sample data included in the Spark distribution and can be run in the `spark-shell`, `pyspark` shell, or `sparkR` shell.
-## Starting Point: `SQLContext`
+## Starting Point: SQLContext
<div class="codetabs">
<div data-lang="scala" markdown="1">
@@ -1036,6 +1036,15 @@ for (teenName in collect(teenNames)) {
</div>
+<div data-lang="python" markdown="1">
+
+{% highlight python %}
+# sqlContext is an existing HiveContext
+sqlContext.sql("REFRESH TABLE my_table")
+{% endhighlight %}
+
+</div>
+
<div data-lang="sql" markdown="1">
{% highlight sql %}
@@ -1054,7 +1063,7 @@ SELECT * FROM parquetTable
</div>
-### Partition discovery
+### Partition Discovery
Table partitioning is a common optimization approach used in systems like Hive. In a partitioned
table, data are usually stored in different directories, with partitioning column values encoded in
@@ -1108,7 +1117,7 @@ can be configured by `spark.sql.sources.partitionColumnTypeInference.enabled`, w
`true`. When type inference is disabled, string type will be used for the partitioning columns.
-### Schema merging
+### Schema Merging
Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with
a simple schema, and gradually add more columns to the schema as needed. In this way, users may end
@@ -1208,6 +1217,79 @@ printSchema(df3)
</div>
+### Hive metastore Parquet table conversion
+
+When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own
+Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the
+`spark.sql.hive.convertMetastoreParquet` configuration, and is turned on by default.
+
+#### Hive/Parquet Schema Reconciliation
+
+There are two key differences between Hive and Parquet from the perspective of table schema
+processing.
+
+1. Hive is case insensitive, while Parquet is not
+1. Hive considers all columns nullable, while nullability in Parquet is significant
+
+Due to this reason, we must reconcile Hive metastore schema with Parquet schema when converting a
+Hive metastore Parquet table to a Spark SQL Parquet table. The reconciliation rules are:
+
+1. Fields that have the same name in both schema must have the same data type regardless of
+ nullability. The reconciled field should have the data type of the Parquet side, so that
+ nullability is respected.
+
+1. The reconciled schema contains exactly those fields defined in Hive metastore schema.
+
+ - Any fields that only appear in the Parquet schema are dropped in the reconciled schema.
+ - Any fileds that only appear in the Hive metastore schema are added as nullable field in the
+ reconciled schema.
+
+#### Metadata Refreshing
+
+Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table
+conversion is enabled, metadata of those converted tables are also cached. If these tables are
+updated by Hive or other external tools, you need to refresh them manually to ensure consistent
+metadata.
+
+<div class="codetabs">
+
+<div data-lang="scala" markdown="1">
+
+{% highlight scala %}
+// sqlContext is an existing HiveContext
+sqlContext.refreshTable("my_table")
+{% endhighlight %}
+
+</div>
+
+<div data-lang="java" markdown="1">
+
+{% highlight java %}
+// sqlContext is an existing HiveContext
+sqlContext.refreshTable("my_table")
+{% endhighlight %}
+
+</div>
+
+<div data-lang="python" markdown="1">
+
+{% highlight python %}
+# sqlContext is an existing HiveContext
+sqlContext.refreshTable("my_table")
+{% endhighlight %}
+
+</div>
+
+<div data-lang="sql" markdown="1">
+
+{% highlight sql %}
+REFRESH TABLE my_table;
+{% endhighlight %}
+
+</div>
+
+</div>
+
### Configuration
Configuration of Parquet can be done using the `setConf` method on `SQLContext` or by running
@@ -1445,8 +1527,8 @@ This command builds a new assembly jar that includes Hive. Note that this Hive a
on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries
(SerDes) in order to access data stored in Hive.
-Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`. Please note when running
-the query on a YARN cluster (`yarn-cluster` mode), the `datanucleus` jars under the `lib_managed/jars` directory
+Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`. Please note when running
+the query on a YARN cluster (`yarn-cluster` mode), the `datanucleus` jars under the `lib_managed/jars` directory
and `hive-site.xml` under `conf/` directory need to be available on the driver and all executors launched by the
YARN cluster. The convenient way to do this is adding them through the `--jars` option and `--file` option of the
`spark-submit` command.
@@ -1889,7 +1971,7 @@ options.
#### DataFrame data reader/writer interface
Based on user feedback, we created a new, more fluid API for reading data in (`SQLContext.read`)
-and writing data out (`DataFrame.write`),
+and writing data out (`DataFrame.write`),
and deprecated the old APIs (e.g. `SQLContext.parquetFile`, `SQLContext.jsonFile`).
See the API docs for `SQLContext.read` (