[SQL] More aggressive defaults

- Turns on compression for in-memory cached data by default - Changes the default parquet compression format back to gzip (we have seen more OOMs with production workloads due to the way Snappy allocates memory) - Ups the batch size to 10,000 rows - Increases the broadcast threshold to 10mb. - Uses our parquet implementation instead of the hive one by default. - Cache parquet metadata by default. Author: Michael Armbrust <michael@databricks.com> Closes #3064 from marmbrus/fasterDefaults and squashes the following commits: 97ee9f8 [Michael Armbrust] parquet codec docs e641694 [Michael Armbrust] Remote also a12866a [Michael Armbrust] Cache metadata. 2d73acc [Michael Armbrust] Update docs defaults. d63d2d5 [Michael Armbrust] document parquet option da373f9 [Michael Armbrust] More aggressive defaults
author: Michael Armbrust <michael@databricks.com> 2014-11-03 14:08:27 -0800
committer: Michael Armbrust <michael@databricks.com> 2014-11-03 14:08:27 -0800
commit: 25bef7e6951301e93004567fc0cef96bf8d1a224 (patch)
tree: 73941695b30cb7cdf96c9805935697162c578b14 /docs
parent: e83f13e8d37ca33f4e183e977d077221b90c6025 (diff)
download: spark-25bef7e6951301e93004567fc0cef96bf8d1a224.tar.gz
spark-25bef7e6951301e93004567fc0cef96bf8d1a224.tar.bz2
spark-25bef7e6951301e93004567fc0cef96bf8d1a224.zip
1 files changed, 13 insertions, 5 deletions
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index d4ade939c3..e399fecbbc 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -582,19 +582,27 @@ Configuration of Parquet can be done using the `setConf` method on SQLContext or
 </tr>
 <tr>
   <td><code>spark.sql.parquet.cacheMetadata</code></td>
-  <td>false</td>
+  <td>true</td>
   <td>
     Turns on caching of Parquet schema metadata.  Can speed up querying of static data.
   </td>
 </tr>
 <tr>
   <td><code>spark.sql.parquet.compression.codec</code></td>
-  <td>snappy</td>
+  <td>gzip</td>
   <td>
     Sets the compression codec use when writing Parquet files. Acceptable values include: 
     uncompressed, snappy, gzip, lzo.
   </td>
 </tr>
+<tr>
+  <td><code>spark.sql.hive.convertMetastoreParquet</code></td>
+  <td>true</td>
+  <td>
+    When set to false, Spark SQL will use the Hive SerDe for parquet tables instead of the built in
+    support.
+  </td>
+</tr>
 </table>
 
 ## JSON Datasets
@@ -815,7 +823,7 @@ Configuration of in-memory caching can be done using the `setConf` method on SQL
 <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
 <tr>
   <td><code>spark.sql.inMemoryColumnarStorage.compressed</code></td>
-  <td>false</td>
+  <td>true</td>
   <td>
     When set to true Spark SQL will automatically select a compression codec for each column based
     on statistics of the data.
@@ -823,7 +831,7 @@ Configuration of in-memory caching can be done using the `setConf` method on SQL
 </tr>
 <tr>
   <td><code>spark.sql.inMemoryColumnarStorage.batchSize</code></td>
-  <td>1000</td>
+  <td>10000</td>
   <td>
     Controls the size of batches for columnar caching.  Larger batch sizes can improve memory utilization
     and compression, but risk OOMs when caching data.
@@ -841,7 +849,7 @@ that these options will be deprecated in future release as more optimizations ar
   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
   <tr>
     <td><code>spark.sql.autoBroadcastJoinThreshold</code></td>
-    <td>10000</td>
+    <td>10485760 (10 MB)</td>
     <td>
       Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when
       performing a join.  By setting this value to -1 broadcasting can be disabled.  Note that currently
author	Michael Armbrust <michael@databricks.com>	2014-11-03 14:08:27 -0800
committer	Michael Armbrust <michael@databricks.com>	2014-11-03 14:08:27 -0800
commit	25bef7e6951301e93004567fc0cef96bf8d1a224 (patch)
tree	73941695b30cb7cdf96c9805935697162c578b14 /docs
parent	e83f13e8d37ca33f4e183e977d077221b90c6025 (diff)
download	spark-25bef7e6951301e93004567fc0cef96bf8d1a224.tar.gz spark-25bef7e6951301e93004567fc0cef96bf8d1a224.tar.bz2 spark-25bef7e6951301e93004567fc0cef96bf8d1a224.zip