aboutsummaryrefslogtreecommitdiff
path: root/licenses/LICENSE-boto.txt
diff options
context:
space:
mode:
authorTathagata Das <tathagata.das1565@gmail.com>2016-04-22 17:17:37 -0700
committerShixiong Zhu <shixiong@databricks.com>2016-04-22 17:17:37 -0700
commitc431a76d0628985bb445189b9a2913dd41b86f7b (patch)
tree723bae4814ffc9e935ee761d7b186aa2b37bee9d /licenses/LICENSE-boto.txt
parentc25b97fccee557c9247ad5bf006a83a55c5e0e32 (diff)
downloadspark-c431a76d0628985bb445189b9a2913dd41b86f7b.tar.gz
spark-c431a76d0628985bb445189b9a2913dd41b86f7b.tar.bz2
spark-c431a76d0628985bb445189b9a2913dd41b86f7b.zip
[SPARK-14832][SQL][STREAMING] Refactor DataSource to ensure schema is inferred only once when creating a file stream
## What changes were proposed in this pull request? When creating a file stream using sqlContext.write.stream(), existing files are scanned twice for finding the schema - Once, when creating a DataSource + StreamingRelation in the DataFrameReader.stream() - Again, when creating streaming Source from the DataSource, in DataSource.createSource() Instead, the schema should be generated only once, at the time of creating the dataframe, and when the streaming source is created, it should just reuse that schema The solution proposed in this PR is to add a lazy field in DataSource that caches the schema. Then streaming Source created by the DataSource can just reuse the schema. ## How was this patch tested? Refactored unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #12591 from tdas/SPARK-14832.
Diffstat (limited to 'licenses/LICENSE-boto.txt')
0 files changed, 0 insertions, 0 deletions