diff options
author | Michael Armbrust <michael@databricks.com> | 2014-11-03 14:08:27 -0800 |
---|---|---|
committer | Michael Armbrust <michael@databricks.com> | 2014-11-03 14:08:27 -0800 |
commit | 25bef7e6951301e93004567fc0cef96bf8d1a224 (patch) | |
tree | 73941695b30cb7cdf96c9805935697162c578b14 /docs | |
parent | e83f13e8d37ca33f4e183e977d077221b90c6025 (diff) | |
download | spark-25bef7e6951301e93004567fc0cef96bf8d1a224.tar.gz spark-25bef7e6951301e93004567fc0cef96bf8d1a224.tar.bz2 spark-25bef7e6951301e93004567fc0cef96bf8d1a224.zip |
[SQL] More aggressive defaults
- Turns on compression for in-memory cached data by default
- Changes the default parquet compression format back to gzip (we have seen more OOMs with production workloads due to the way Snappy allocates memory)
- Ups the batch size to 10,000 rows
- Increases the broadcast threshold to 10mb.
- Uses our parquet implementation instead of the hive one by default.
- Cache parquet metadata by default.
Author: Michael Armbrust <michael@databricks.com>
Closes #3064 from marmbrus/fasterDefaults and squashes the following commits:
97ee9f8 [Michael Armbrust] parquet codec docs
e641694 [Michael Armbrust] Remote also
a12866a [Michael Armbrust] Cache metadata.
2d73acc [Michael Armbrust] Update docs defaults.
d63d2d5 [Michael Armbrust] document parquet option
da373f9 [Michael Armbrust] More aggressive defaults
Diffstat (limited to 'docs')
-rw-r--r-- | docs/sql-programming-guide.md | 18 |
1 files changed, 13 insertions, 5 deletions
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index d4ade939c3..e399fecbbc 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -582,19 +582,27 @@ Configuration of Parquet can be done using the `setConf` method on SQLContext or </tr> <tr> <td><code>spark.sql.parquet.cacheMetadata</code></td> - <td>false</td> + <td>true</td> <td> Turns on caching of Parquet schema metadata. Can speed up querying of static data. </td> </tr> <tr> <td><code>spark.sql.parquet.compression.codec</code></td> - <td>snappy</td> + <td>gzip</td> <td> Sets the compression codec use when writing Parquet files. Acceptable values include: uncompressed, snappy, gzip, lzo. </td> </tr> +<tr> + <td><code>spark.sql.hive.convertMetastoreParquet</code></td> + <td>true</td> + <td> + When set to false, Spark SQL will use the Hive SerDe for parquet tables instead of the built in + support. + </td> +</tr> </table> ## JSON Datasets @@ -815,7 +823,7 @@ Configuration of in-memory caching can be done using the `setConf` method on SQL <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr> <tr> <td><code>spark.sql.inMemoryColumnarStorage.compressed</code></td> - <td>false</td> + <td>true</td> <td> When set to true Spark SQL will automatically select a compression codec for each column based on statistics of the data. @@ -823,7 +831,7 @@ Configuration of in-memory caching can be done using the `setConf` method on SQL </tr> <tr> <td><code>spark.sql.inMemoryColumnarStorage.batchSize</code></td> - <td>1000</td> + <td>10000</td> <td> Controls the size of batches for columnar caching. Larger batch sizes can improve memory utilization and compression, but risk OOMs when caching data. @@ -841,7 +849,7 @@ that these options will be deprecated in future release as more optimizations ar <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr> <tr> <td><code>spark.sql.autoBroadcastJoinThreshold</code></td> - <td>10000</td> + <td>10485760 (10 MB)</td> <td> Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently |