[SPARK-10063][SQL] Remove DirectParquetOutputCommitter

## What changes were proposed in this pull request? This patch removes DirectParquetOutputCommitter. This was initially created by Databricks as a faster way to write Parquet data to S3. However, given how the underlying S3 Hadoop implementation works, this committer only works when there are no failures. If there are multiple attempts of the same task (e.g. speculation or task failures or node failures), the output data can be corrupted. I don't think this performance optimization outweighs the correctness issue. ## How was this patch tested? Removed the related tests also. Author: Reynold Xin <rxin@databricks.com> Closes #12229 from rxin/SPARK-10063.
author: Reynold Xin <rxin@databricks.com> 2016-04-07 00:51:45 -0700
committer: Reynold Xin <rxin@databricks.com> 2016-04-07 00:51:45 -0700
commit: 9ca0760d6769199f164a661655912f028234eb1c (patch)
tree: 9077fffc4e74921b25bc15b2e41150f98ac7000c /docs/sql-programming-guide.md
parent: e11aa9ec5c3cdcd8ca08d2486a7208840ad77bf8 (diff)
download: spark-9ca0760d6769199f164a661655912f028234eb1c.tar.gz
spark-9ca0760d6769199f164a661655912f028234eb1c.tar.bz2
spark-9ca0760d6769199f164a661655912f028234eb1c.zip
1 files changed, 0 insertions, 33 deletions
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 274a8edb0c..63310be22c 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1467,37 +1467,6 @@ Configuration of Parquet can be done using the `setConf` method on `SQLContext`
   </td>
 </tr>
 <tr>
-  <td><code>spark.sql.parquet.output.committer.class</code></td>
-  <td><code>org.apache.parquet.hadoop.<br />ParquetOutputCommitter</code></td>
-  <td>
-    <p>
-      The output committer class used by Parquet. The specified class needs to be a subclass of
-      <code>org.apache.hadoop.<br />mapreduce.OutputCommitter</code>. Typically, it's also a
-      subclass of <code>org.apache.parquet.hadoop.ParquetOutputCommitter</code>.
-    </p>
-    <p>
-      <b>Note:</b>
-      <ul>
-        <li>
-          This option is automatically ignored if <code>spark.speculation</code> is turned on.
-        </li>
-        <li>
-          This option must be set via Hadoop <code>Configuration</code> rather than Spark
-          <code>SQLConf</code>.
-        </li>
-        <li>
-          This option overrides <code>spark.sql.sources.<br />outputCommitterClass</code>.
-        </li>
-      </ul>
-    </p>
-    <p>
-      Spark SQL comes with a builtin
-      <code>org.apache.spark.sql.<br />parquet.DirectParquetOutputCommitter</code>, which can be more
-      efficient then the default Parquet output committer when writing data to S3.
-    </p>
-  </td>
-</tr>
-<tr>
   <td><code>spark.sql.parquet.mergeSchema</code></td>
   <td><code>false</code></td>
   <td>
@@ -2165,8 +2134,6 @@ options.
  - In the `sql` dialect, floating point numbers are now parsed as decimal. HiveQL parsing remains
    unchanged.
  - The canonical name of SQL/DataFrame functions are now lower case (e.g. sum vs SUM).
- - It has been determined that using the DirectOutputCommitter when speculation is enabled is unsafe
-   and thus this output committer will not be used when speculation is on, independent of configuration.
  - JSON data source will not automatically load new files that are created by other applications
    (i.e. files that are not inserted to the dataset through Spark SQL).
    For a JSON persistent table (i.e. the metadata of the table is stored in Hive Metastore),
author	Reynold Xin <rxin@databricks.com>	2016-04-07 00:51:45 -0700
committer	Reynold Xin <rxin@databricks.com>	2016-04-07 00:51:45 -0700
commit	9ca0760d6769199f164a661655912f028234eb1c (patch)
tree	9077fffc4e74921b25bc15b2e41150f98ac7000c /docs/sql-programming-guide.md
parent	e11aa9ec5c3cdcd8ca08d2486a7208840ad77bf8 (diff)
download	spark-9ca0760d6769199f164a661655912f028234eb1c.tar.gz spark-9ca0760d6769199f164a661655912f028234eb1c.tar.bz2 spark-9ca0760d6769199f164a661655912f028234eb1c.zip