aboutsummaryrefslogtreecommitdiff
path: root/pom.xml
diff options
context:
space:
mode:
authorJosh Rosen <joshrosen@databricks.com>2016-12-02 21:14:34 -0800
committerReynold Xin <rxin@databricks.com>2016-12-02 21:14:34 -0800
commit7c33b0fd050f3d2b08c1cfd7efbff8166832c1af (patch)
tree6f60911a5b8374e8a0b16f095ff7466266940bf5 /pom.xml
parentc7c7265950945a1b14165365600bdbfd540cf522 (diff)
downloadspark-7c33b0fd050f3d2b08c1cfd7efbff8166832c1af.tar.gz
spark-7c33b0fd050f3d2b08c1cfd7efbff8166832c1af.tar.bz2
spark-7c33b0fd050f3d2b08c1cfd7efbff8166832c1af.zip
[SPARK-18362][SQL] Use TextFileFormat in implementation of CSVFileFormat
## What changes were proposed in this pull request? This patch significantly improves the IO / file listing performance of schema inference in Spark's built-in CSV data source. Previously, this data source used the legacy `SparkContext.hadoopFile` and `SparkContext.hadoopRDD` methods to read files during its schema inference step, causing huge file-listing bottlenecks on the driver. This patch refactors this logic to use Spark SQL's `text` data source to read files during this step. The text data source still performs some unnecessary file listing (since in theory we already have resolved the table prior to schema inference and therefore should be able to scan without performing _any_ extra listing), but that listing is much faster and takes place in parallel. In one production workload operating over tens of thousands of files, this change managed to reduce schema inference time from 7 minutes to 2 minutes. A similar problem also affects the JSON file format and this patch originally fixed that as well, but I've decided to split that change into a separate patch so as not to conflict with changes in another JSON PR. ## How was this patch tested? Existing unit tests, plus manual benchmarking on a production workload. Author: Josh Rosen <joshrosen@databricks.com> Closes #15813 from JoshRosen/use-text-data-source-in-csv-and-json.
Diffstat (limited to 'pom.xml')
0 files changed, 0 insertions, 0 deletions