aboutsummaryrefslogtreecommitdiff
path: root/docs/building-spark.md
diff options
context:
space:
mode:
authorReynold Xin <rxin@databricks.com>2016-10-20 12:18:56 -0700
committerReynold Xin <rxin@databricks.com>2016-10-20 12:18:56 -0700
commit7f9ec19eae60abe589ffd22259a9065e7e353a57 (patch)
tree304e751a63b5ec83ec4e8fa918573020890f2ae5 /docs/building-spark.md
parent947f4f25273161dc4719419a35613a71c2e2a150 (diff)
downloadspark-7f9ec19eae60abe589ffd22259a9065e7e353a57.tar.gz
spark-7f9ec19eae60abe589ffd22259a9065e7e353a57.tar.bz2
spark-7f9ec19eae60abe589ffd22259a9065e7e353a57.zip
[SPARK-18021][SQL] Refactor file name specification for data sources
## What changes were proposed in this pull request? Currently each data source OutputWriter is responsible for specifying the entire file name for each file output. This, however, does not make any sense because we rely on file naming schemes for certain behaviors in Spark SQL, e.g. bucket id. The current approach allows individual data sources to break the implementation of bucketing. On the flip side, we also don't want to move file naming entirely out of data sources, because different data sources do want to specify different extensions. This patch divides file name specification into two parts: the first part is a prefix specified by the caller of OutputWriter (in WriteOutput), and the second part is the suffix that can be specified by the OutputWriter itself. Note that a side effect of this change is that now all file based data sources also support bucketing automatically. There are also some other minor cleanups: - Removed the UUID passed through generic Configuration string - Some minor rewrites for better clarity - Renamed "path" in multiple places to "stagingDir", to more accurately reflect its meaning ## How was this patch tested? This should be covered by existing data source tests. Author: Reynold Xin <rxin@databricks.com> Closes #15562 from rxin/SPARK-18021.
Diffstat (limited to 'docs/building-spark.md')
0 files changed, 0 insertions, 0 deletions