diff options
author | Cheng Lian <lian@databricks.com> | 2015-06-23 17:24:26 -0700 |
---|---|---|
committer | Cheng Lian <lian@databricks.com> | 2015-06-23 17:24:26 -0700 |
commit | 111d6b9b8a584b962b6ae80c7aa8c45845ce0099 (patch) | |
tree | bc5955310ec43cb175ea77a147fc3bd99340e27b /docs/sql-programming-guide.md | |
parent | 7fb5ae5024284593204779ff463bfbdb4d1c6da5 (diff) | |
download | spark-111d6b9b8a584b962b6ae80c7aa8c45845ce0099.tar.gz spark-111d6b9b8a584b962b6ae80c7aa8c45845ce0099.tar.bz2 spark-111d6b9b8a584b962b6ae80c7aa8c45845ce0099.zip |
[SPARK-8139] [SQL] Updates docs and comments of data sources and Parquet output committer options
This PR only applies to master branch (1.5.0-SNAPSHOT) since it references `org.apache.parquet` classes which only appear in Parquet 1.7.0.
Author: Cheng Lian <lian@databricks.com>
Closes #6683 from liancheng/output-committer-docs and squashes the following commits:
b4648b8 [Cheng Lian] Removes spark.sql.sources.outputCommitterClass as it's not a public option
ee63923 [Cheng Lian] Updates docs and comments of data sources and Parquet output committer options
Diffstat (limited to 'docs/sql-programming-guide.md')
-rw-r--r-- | docs/sql-programming-guide.md | 30 |
1 files changed, 29 insertions, 1 deletions
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index 9107c9b676..2786e3d2cd 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -1348,6 +1348,34 @@ Configuration of Parquet can be done using the `setConf` method on `SQLContext` support. </td> </tr> +<tr> + <td><code>spark.sql.parquet.output.committer.class</code></td> + <td><code>org.apache.parquet.hadoop.<br />ParquetOutputCommitter</code></td> + <td> + <p> + The output committer class used by Parquet. The specified class needs to be a subclass of + <code>org.apache.hadoop.<br />mapreduce.OutputCommitter</code>. Typically, it's also a + subclass of <code>org.apache.parquet.hadoop.ParquetOutputCommitter</code>. + </p> + <p> + <b>Note:</b> + <ul> + <li> + This option must be set via Hadoop <code>Configuration</code> rather than Spark + <code>SQLConf</code>. + </li> + <li> + This option overrides <code>spark.sql.sources.<br />outputCommitterClass</code>. + </li> + </ul> + </p> + <p> + Spark SQL comes with a builtin + <code>org.apache.spark.sql.<br />parquet.DirectParquetOutputCommitter</code>, which can be more + efficient then the default Parquet output committer when writing data to S3. + </p> + </td> +</tr> </table> ## JSON Datasets @@ -1876,7 +1904,7 @@ that these options will be deprecated in future release as more optimizations ar Configures the number of partitions to use when shuffling data for joins or aggregations. </td> </tr> - <tr> + <tr> <td><code>spark.sql.planner.externalSort</code></td> <td>false</td> <td> |