aboutsummaryrefslogtreecommitdiff
path: root/docs/running-on-mesos.md
diff options
context:
space:
mode:
authorAndre Schumacher <andre.schumacher@iki.fi>2014-05-16 13:41:41 -0700
committerReynold Xin <rxin@apache.org>2014-05-16 13:41:41 -0700
commit40d6acd6ba2feccc600301f5c47d4f90157138b1 (patch)
treef7575f51808dd473f41329c46f3122ce71140167 /docs/running-on-mesos.md
parent032d6632ad4ab88c97c9e568b63169a114220a02 (diff)
downloadspark-40d6acd6ba2feccc600301f5c47d4f90157138b1.tar.gz
spark-40d6acd6ba2feccc600301f5c47d4f90157138b1.tar.bz2
spark-40d6acd6ba2feccc600301f5c47d4f90157138b1.zip
SPARK-1487 [SQL] Support record filtering via predicate pushdown in Parquet
Simple filter predicates such as LessThan, GreaterThan, etc., where one side is a literal and the other one a NamedExpression are now pushed down to the underlying ParquetTableScan. Here are some results for a microbenchmark with a simple schema of six fields of different types where most records failed the test: | Uncompressed | Compressed -------------| ------------- | ------------- File size | 10 GB | 2 GB Speedup | 2 | 1.8 Since mileage may vary I added a new option to SparkConf: `org.apache.spark.sql.parquet.filter.pushdown` Default value would be `true` and setting it to `false` disables the pushdown. When most rows are expected to pass the filter or when there are few fields performance can be better when pushdown is disabled. The default should fit situations with a reasonable number of (possibly nested) fields where not too many records on average pass the filter. Because of an issue with Parquet ([see here](https://github.com/Parquet/parquet-mr/issues/371])) currently only predicates on non-nullable attributes are pushed down. If one would know that for a given table no optional fields have missing values one could also allow overriding this. Author: Andre Schumacher <andre.schumacher@iki.fi> Closes #511 from AndreSchumacher/parquet_filter and squashes the following commits: 16bfe83 [Andre Schumacher] Removing leftovers from merge during rebase 7b304ca [Andre Schumacher] Fixing formatting c36d5cb [Andre Schumacher] Scalastyle 3da98db [Andre Schumacher] Second round of review feedback 7a78265 [Andre Schumacher] Fixing broken formatting in ParquetFilter a86553b [Andre Schumacher] First round of code review feedback b0f7806 [Andre Schumacher] Optimizing imports in ParquetTestData 85fea2d [Andre Schumacher] Adding SparkConf setting to disable filter predicate pushdown f0ad3cf [Andre Schumacher] Undoing changes not needed for this PR 210e9cb [Andre Schumacher] Adding disjunctive filter predicates a93a588 [Andre Schumacher] Adding unit test for filtering 6d22666 [Andre Schumacher] Extending ParquetFilters 93e8192 [Andre Schumacher] First commit Parquet record filtering
Diffstat (limited to 'docs/running-on-mesos.md')
0 files changed, 0 insertions, 0 deletions