aboutsummaryrefslogtreecommitdiff
path: root/python/pyspark/sql/conf.py
diff options
context:
space:
mode:
authorTathagata Das <tathagata.das1565@gmail.com>2016-05-04 11:02:48 -0700
committerTathagata Das <tathagata.das1565@gmail.com>2016-05-04 11:02:48 -0700
commit0fd3a4748416233f034ec137d95f0a4c8712d396 (patch)
tree6c370ad0188f01d2d2b9fa9f232791a1743fc6cc /python/pyspark/sql/conf.py
parent6274a520fa743b7d079fde4a3033da5c3a2532a1 (diff)
downloadspark-0fd3a4748416233f034ec137d95f0a4c8712d396.tar.gz
spark-0fd3a4748416233f034ec137d95f0a4c8712d396.tar.bz2
spark-0fd3a4748416233f034ec137d95f0a4c8712d396.zip
[SPARK-15103][SQL] Refactored FileCatalog class to allow StreamFileCatalog to infer partitioning
## What changes were proposed in this pull request? File Stream Sink writes the list of written files in a metadata log. StreamFileCatalog reads the list of the files for processing. However StreamFileCatalog does not infer partitioning like HDFSFileCatalog. This PR enables that by refactoring HDFSFileCatalog to create an abstract class PartitioningAwareFileCatalog, that has all the functionality to infer partitions from a list of leaf files. - HDFSFileCatalog has been renamed to ListingFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from recursive directory scanning. - StreamFileCatalog has been renamed to MetadataLogFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from the metadata log. - The above two classes has been moved into their own files as they are not interfaces that should be in fileSourceInterfaces.scala. ## How was this patch tested? - FileStreamSinkSuite was update to see if partitioning gets inferred, and on reading whether the partitions get pruned correctly based on the query. - Other unit tests are unchanged and pass as expected. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #12879 from tdas/SPARK-15103.
Diffstat (limited to 'python/pyspark/sql/conf.py')
0 files changed, 0 insertions, 0 deletions