diff options
author | Tathagata Das <tathagata.das1565@gmail.com> | 2016-04-22 17:17:37 -0700 |
---|---|---|
committer | Shixiong Zhu <shixiong@databricks.com> | 2016-04-22 17:17:37 -0700 |
commit | c431a76d0628985bb445189b9a2913dd41b86f7b (patch) | |
tree | 723bae4814ffc9e935ee761d7b186aa2b37bee9d /licenses/LICENSE-boto.txt | |
parent | c25b97fccee557c9247ad5bf006a83a55c5e0e32 (diff) | |
download | spark-c431a76d0628985bb445189b9a2913dd41b86f7b.tar.gz spark-c431a76d0628985bb445189b9a2913dd41b86f7b.tar.bz2 spark-c431a76d0628985bb445189b9a2913dd41b86f7b.zip |
[SPARK-14832][SQL][STREAMING] Refactor DataSource to ensure schema is inferred only once when creating a file stream
## What changes were proposed in this pull request?
When creating a file stream using sqlContext.write.stream(), existing files are scanned twice for finding the schema
- Once, when creating a DataSource + StreamingRelation in the DataFrameReader.stream()
- Again, when creating streaming Source from the DataSource, in DataSource.createSource()
Instead, the schema should be generated only once, at the time of creating the dataframe, and when the streaming source is created, it should just reuse that schema
The solution proposed in this PR is to add a lazy field in DataSource that caches the schema. Then streaming Source created by the DataSource can just reuse the schema.
## How was this patch tested?
Refactored unit tests.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #12591 from tdas/SPARK-14832.
Diffstat (limited to 'licenses/LICENSE-boto.txt')
0 files changed, 0 insertions, 0 deletions