diff options
author | Cheng Lian <lian@databricks.com> | 2015-06-23 14:19:21 -0700 |
---|---|---|
committer | Cheng Lian <lian@databricks.com> | 2015-06-23 14:19:21 -0700 |
commit | d96d7b55746cf034e3935ec4b22614a99e48c498 (patch) | |
tree | 730159c3fd40f02bffae497d89b942e0157b483b /docs/sql-programming-guide.md | |
parent | a8031183aff2e23de9204ddfc7e7f5edbf052a7e (diff) | |
download | spark-d96d7b55746cf034e3935ec4b22614a99e48c498.tar.gz spark-d96d7b55746cf034e3935ec4b22614a99e48c498.tar.bz2 spark-d96d7b55746cf034e3935ec4b22614a99e48c498.zip |
[DOC] [SQL] Addes Hive metastore Parquet table conversion section
This PR adds a section about Hive metastore Parquet table conversion. It documents:
1. Schema reconciliation rules introduced in #5214 (see [this comment] [1] in #5188)
2. Metadata refreshing requirement introduced in #5339
[1]: https://github.com/apache/spark/pull/5188#issuecomment-86531248
Author: Cheng Lian <lian@databricks.com>
Closes #5348 from liancheng/sql-doc-parquet-conversion and squashes the following commits:
42ae0d0 [Cheng Lian] Adds Python `refreshTable` snippet
4c9847d [Cheng Lian] Resorts to SQL for Python metadata refreshing snippet
756e660 [Cheng Lian] Adds Python snippet for metadata refreshing
50675db [Cheng Lian] Addes Hive metastore Parquet table conversion section
Diffstat (limited to 'docs/sql-programming-guide.md')
-rw-r--r-- | docs/sql-programming-guide.md | 94 |
1 files changed, 88 insertions, 6 deletions
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index 26c036f664..9107c9b676 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -22,7 +22,7 @@ The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark. All of the examples on this page use sample data included in the Spark distribution and can be run in the `spark-shell`, `pyspark` shell, or `sparkR` shell. -## Starting Point: `SQLContext` +## Starting Point: SQLContext <div class="codetabs"> <div data-lang="scala" markdown="1"> @@ -1036,6 +1036,15 @@ for (teenName in collect(teenNames)) { </div> +<div data-lang="python" markdown="1"> + +{% highlight python %} +# sqlContext is an existing HiveContext +sqlContext.sql("REFRESH TABLE my_table") +{% endhighlight %} + +</div> + <div data-lang="sql" markdown="1"> {% highlight sql %} @@ -1054,7 +1063,7 @@ SELECT * FROM parquetTable </div> -### Partition discovery +### Partition Discovery Table partitioning is a common optimization approach used in systems like Hive. In a partitioned table, data are usually stored in different directories, with partitioning column values encoded in @@ -1108,7 +1117,7 @@ can be configured by `spark.sql.sources.partitionColumnTypeInference.enabled`, w `true`. When type inference is disabled, string type will be used for the partitioning columns. -### Schema merging +### Schema Merging Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end @@ -1208,6 +1217,79 @@ printSchema(df3) </div> +### Hive metastore Parquet table conversion + +When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own +Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the +`spark.sql.hive.convertMetastoreParquet` configuration, and is turned on by default. + +#### Hive/Parquet Schema Reconciliation + +There are two key differences between Hive and Parquet from the perspective of table schema +processing. + +1. Hive is case insensitive, while Parquet is not +1. Hive considers all columns nullable, while nullability in Parquet is significant + +Due to this reason, we must reconcile Hive metastore schema with Parquet schema when converting a +Hive metastore Parquet table to a Spark SQL Parquet table. The reconciliation rules are: + +1. Fields that have the same name in both schema must have the same data type regardless of + nullability. The reconciled field should have the data type of the Parquet side, so that + nullability is respected. + +1. The reconciled schema contains exactly those fields defined in Hive metastore schema. + + - Any fields that only appear in the Parquet schema are dropped in the reconciled schema. + - Any fileds that only appear in the Hive metastore schema are added as nullable field in the + reconciled schema. + +#### Metadata Refreshing + +Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table +conversion is enabled, metadata of those converted tables are also cached. If these tables are +updated by Hive or other external tools, you need to refresh them manually to ensure consistent +metadata. + +<div class="codetabs"> + +<div data-lang="scala" markdown="1"> + +{% highlight scala %} +// sqlContext is an existing HiveContext +sqlContext.refreshTable("my_table") +{% endhighlight %} + +</div> + +<div data-lang="java" markdown="1"> + +{% highlight java %} +// sqlContext is an existing HiveContext +sqlContext.refreshTable("my_table") +{% endhighlight %} + +</div> + +<div data-lang="python" markdown="1"> + +{% highlight python %} +# sqlContext is an existing HiveContext +sqlContext.refreshTable("my_table") +{% endhighlight %} + +</div> + +<div data-lang="sql" markdown="1"> + +{% highlight sql %} +REFRESH TABLE my_table; +{% endhighlight %} + +</div> + +</div> + ### Configuration Configuration of Parquet can be done using the `setConf` method on `SQLContext` or by running @@ -1445,8 +1527,8 @@ This command builds a new assembly jar that includes Hive. Note that this Hive a on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries (SerDes) in order to access data stored in Hive. -Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`. Please note when running -the query on a YARN cluster (`yarn-cluster` mode), the `datanucleus` jars under the `lib_managed/jars` directory +Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`. Please note when running +the query on a YARN cluster (`yarn-cluster` mode), the `datanucleus` jars under the `lib_managed/jars` directory and `hive-site.xml` under `conf/` directory need to be available on the driver and all executors launched by the YARN cluster. The convenient way to do this is adding them through the `--jars` option and `--file` option of the `spark-submit` command. @@ -1889,7 +1971,7 @@ options. #### DataFrame data reader/writer interface Based on user feedback, we created a new, more fluid API for reading data in (`SQLContext.read`) -and writing data out (`DataFrame.write`), +and writing data out (`DataFrame.write`), and deprecated the old APIs (e.g. `SQLContext.parquetFile`, `SQLContext.jsonFile`). See the API docs for `SQLContext.read` ( |