aboutsummaryrefslogtreecommitdiff
path: root/docs/sql-programming-guide.md
diff options
context:
space:
mode:
authorCheng Lian <lian@databricks.com>2015-06-23 17:24:26 -0700
committerCheng Lian <lian@databricks.com>2015-06-23 17:24:26 -0700
commit111d6b9b8a584b962b6ae80c7aa8c45845ce0099 (patch)
treebc5955310ec43cb175ea77a147fc3bd99340e27b /docs/sql-programming-guide.md
parent7fb5ae5024284593204779ff463bfbdb4d1c6da5 (diff)
downloadspark-111d6b9b8a584b962b6ae80c7aa8c45845ce0099.tar.gz
spark-111d6b9b8a584b962b6ae80c7aa8c45845ce0099.tar.bz2
spark-111d6b9b8a584b962b6ae80c7aa8c45845ce0099.zip
[SPARK-8139] [SQL] Updates docs and comments of data sources and Parquet output committer options
This PR only applies to master branch (1.5.0-SNAPSHOT) since it references `org.apache.parquet` classes which only appear in Parquet 1.7.0. Author: Cheng Lian <lian@databricks.com> Closes #6683 from liancheng/output-committer-docs and squashes the following commits: b4648b8 [Cheng Lian] Removes spark.sql.sources.outputCommitterClass as it's not a public option ee63923 [Cheng Lian] Updates docs and comments of data sources and Parquet output committer options
Diffstat (limited to 'docs/sql-programming-guide.md')
-rw-r--r--docs/sql-programming-guide.md30
1 files changed, 29 insertions, 1 deletions
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 9107c9b676..2786e3d2cd 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1348,6 +1348,34 @@ Configuration of Parquet can be done using the `setConf` method on `SQLContext`
support.
</td>
</tr>
+<tr>
+ <td><code>spark.sql.parquet.output.committer.class</code></td>
+ <td><code>org.apache.parquet.hadoop.<br />ParquetOutputCommitter</code></td>
+ <td>
+ <p>
+ The output committer class used by Parquet. The specified class needs to be a subclass of
+ <code>org.apache.hadoop.<br />mapreduce.OutputCommitter</code>. Typically, it's also a
+ subclass of <code>org.apache.parquet.hadoop.ParquetOutputCommitter</code>.
+ </p>
+ <p>
+ <b>Note:</b>
+ <ul>
+ <li>
+ This option must be set via Hadoop <code>Configuration</code> rather than Spark
+ <code>SQLConf</code>.
+ </li>
+ <li>
+ This option overrides <code>spark.sql.sources.<br />outputCommitterClass</code>.
+ </li>
+ </ul>
+ </p>
+ <p>
+ Spark SQL comes with a builtin
+ <code>org.apache.spark.sql.<br />parquet.DirectParquetOutputCommitter</code>, which can be more
+ efficient then the default Parquet output committer when writing data to S3.
+ </p>
+ </td>
+</tr>
</table>
## JSON Datasets
@@ -1876,7 +1904,7 @@ that these options will be deprecated in future release as more optimizations ar
Configures the number of partitions to use when shuffling data for joins or aggregations.
</td>
</tr>
- <tr>
+ <tr>
<td><code>spark.sql.planner.externalSort</code></td>
<td>false</td>
<td>