[SPARK-14832][SQL][STREAMING] Refactor DataSource to ensure schema is inferred only once when creating a file stream - spark

diff options

author	Tathagata Das <tathagata.das1565@gmail.com>	2016-04-22 17:17:37 -0700
committer	Shixiong Zhu <shixiong@databricks.com>	2016-04-22 17:17:37 -0700
commit	c431a76d0628985bb445189b9a2913dd41b86f7b (patch)
tree	723bae4814ffc9e935ee761d7b186aa2b37bee9d /licenses/LICENSE-boto.txt
parent	c25b97fccee557c9247ad5bf006a83a55c5e0e32 (diff)
download	spark-c431a76d0628985bb445189b9a2913dd41b86f7b.tar.gz spark-c431a76d0628985bb445189b9a2913dd41b86f7b.tar.bz2 spark-c431a76d0628985bb445189b9a2913dd41b86f7b.zip

[SPARK-14832][SQL][STREAMING] Refactor DataSource to ensure schema is inferred only once when creating a file stream

## What changes were proposed in this pull request? When creating a file stream using sqlContext.write.stream(), existing files are scanned twice for finding the schema - Once, when creating a DataSource + StreamingRelation in the DataFrameReader.stream() - Again, when creating streaming Source from the DataSource, in DataSource.createSource() Instead, the schema should be generated only once, at the time of creating the dataframe, and when the streaming source is created, it should just reuse that schema The solution proposed in this PR is to add a lazy field in DataSource that caches the schema. Then streaming Source created by the DataSource can just reuse the schema. ## How was this patch tested? Refactored unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #12591 from tdas/SPARK-14832.

Diffstat (limited to 'licenses/LICENSE-boto.txt')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: