diff options
author | Cheng Lian <lian@databricks.com> | 2015-05-13 11:04:10 -0700 |
---|---|---|
committer | Michael Armbrust <michael@databricks.com> | 2015-05-13 11:04:10 -0700 |
commit | 7ff16e8abef9fbf4a4855e23c256b22e62e560a6 (patch) | |
tree | 1be1249ecb9db02ef5bf8820f7c44a7fbe71a6ff /project | |
parent | bec938f777a2e18757c7d04504d86a5342e2b49e (diff) | |
download | spark-7ff16e8abef9fbf4a4855e23c256b22e62e560a6.tar.gz spark-7ff16e8abef9fbf4a4855e23c256b22e62e560a6.tar.bz2 spark-7ff16e8abef9fbf4a4855e23c256b22e62e560a6.zip |
[SPARK-7567] [SQL] Migrating Parquet data source to FSBasedRelation
This PR migrates Parquet data source to the newly introduced `FSBasedRelation`. `FSBasedParquetRelation` is created to replace `ParquetRelation2`. Major differences are:
1. Partition discovery code has been factored out to `FSBasedRelation`
1. `AppendingParquetOutputFormat` is not used now. Instead, an anonymous subclass of `ParquetOutputFormat` is used to handle appending and writing dynamic partitions
1. When scanning partitioned tables, `FSBasedParquetRelation.buildScan` only builds an `RDD[Row]` for a single selected partition
1. `FSBasedParquetRelation` doesn't rely on Catalyst expressions for filter push down, thus it doesn't extend `CatalystScan` anymore
After migrating `JSONRelation` (which extends `CatalystScan`), we can remove `CatalystScan`.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/6090)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes #6090 from liancheng/parquet-migration and squashes the following commits:
6063f87 [Cheng Lian] Casts to OutputCommitter rather than FileOutputCommtter
bfd1cf0 [Cheng Lian] Fixes compilation error introduced while rebasing
f9ea56e [Cheng Lian] Adds ParquetRelation2 related classes to MiMa check whitelist
261d8c1 [Cheng Lian] Minor bug fix and more tests
db65660 [Cheng Lian] Migrates Parquet data source to FSBasedRelation
Diffstat (limited to 'project')
-rw-r--r-- | project/MimaExcludes.scala | 6 |
1 files changed, 6 insertions, 0 deletions
diff --git a/project/MimaExcludes.scala b/project/MimaExcludes.scala index a47e29e2ef..f31f0e554e 100644 --- a/project/MimaExcludes.scala +++ b/project/MimaExcludes.scala @@ -111,6 +111,12 @@ object MimaExcludes { "org.apache.spark.sql.parquet.ParquetRelation2$PartitionValues"), ProblemFilters.exclude[MissingClassProblem]( "org.apache.spark.sql.parquet.ParquetRelation2$PartitionValues$"), + ProblemFilters.exclude[MissingClassProblem]( + "org.apache.spark.sql.parquet.ParquetRelation2"), + ProblemFilters.exclude[MissingClassProblem]( + "org.apache.spark.sql.parquet.ParquetRelation2$"), + ProblemFilters.exclude[MissingClassProblem]( + "org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache"), // These test support classes were moved out of src/main and into src/test: ProblemFilters.exclude[MissingClassProblem]( "org.apache.spark.sql.parquet.ParquetTestData"), |