[SPARK-10216][SQL] Avoid creating empty files during overwriting with group by query - spark

diff options

author	hyukjinkwon <gurwls223@gmail.com>	2016-05-17 11:18:51 -0700
committer	Michael Armbrust <michael@databricks.com>	2016-05-17 11:18:51 -0700
commit	8d05a7a98bdbd3ce7c81d273e05a375877ebe68f (patch)
tree	07ef7adec405e8249f185fe637ec9ebb6a3c2421 /docs/ml-features.md
parent	20a89478e168cb6901ef89f4cb6aa79193ed244a (diff)
download	spark-8d05a7a98bdbd3ce7c81d273e05a375877ebe68f.tar.gz spark-8d05a7a98bdbd3ce7c81d273e05a375877ebe68f.tar.bz2 spark-8d05a7a98bdbd3ce7c81d273e05a375877ebe68f.zip

[SPARK-10216][SQL] Avoid creating empty files during overwriting with group by query

## What changes were proposed in this pull request? Currently, `INSERT INTO` with `GROUP BY` query tries to make at least 200 files (default value of `spark.sql.shuffle.partition`), which results in lots of empty files. This PR makes it avoid creating empty files during overwriting into Hive table and in internal data sources with group by query. This checks whether the given partition has data in it or not and creates/writes file only when it actually has data. ## How was this patch tested? Unittests in `InsertIntoHiveTableSuite` and `HadoopFsRelationTest`. Closes #8411 Author: hyukjinkwon <gurwls223@gmail.com> Author: Keuntae Park <sirpkt@apache.org> Closes #12855 from HyukjinKwon/pr/8411.

Diffstat (limited to 'docs/ml-features.md')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: