diff options
author | Cheng Lian <lian@databricks.com> | 2015-10-31 21:16:09 -0700 |
---|---|---|
committer | Davies Liu <davies.liu@gmail.com> | 2015-10-31 21:16:09 -0700 |
commit | aa494a9c2ebd59baec47beb434cd09bf3f188218 (patch) | |
tree | 54bd81e8b7540d08c27be3671bc250ec54020bbf /docs/_layouts | |
parent | 40d3c6797a3dfd037eb69b2bcd336d8544deddf5 (diff) | |
download | spark-aa494a9c2ebd59baec47beb434cd09bf3f188218.tar.gz spark-aa494a9c2ebd59baec47beb434cd09bf3f188218.tar.bz2 spark-aa494a9c2ebd59baec47beb434cd09bf3f188218.zip |
[SPARK-11117] [SPARK-11345] [SQL] Makes all HadoopFsRelation data sources produce UnsafeRow
This PR fixes two issues:
1. `PhysicalRDD.outputsUnsafeRows` is always `false`
Thus a `ConvertToUnsafe` operator is often required even if the underlying data source relation does output `UnsafeRow`.
1. Internal/external row conversion for `HadoopFsRelation` is kinda messy
Currently we're using `HadoopFsRelation.needConversion` and [dirty type erasure hacks][1] to indicate whether the relation outputs external row or internal row and apply external-to-internal conversion when necessary. Basically, all builtin `HadoopFsRelation` data sources, i.e. Parquet, JSON, ORC, and Text output `InternalRow`, while typical external `HadoopFsRelation` data sources, e.g. spark-avro and spark-csv, output `Row`.
This PR adds a `private[sql]` interface method `HadoopFsRelation.buildInternalScan`, which by default invokes `HadoopFsRelation.buildScan` and converts `Row`s to `UnsafeRow`s (which are also `InternalRow`s). All builtin `HadoopFsRelation` data sources override this method and directly output `UnsafeRow`s. In this way, now `HadoopFsRelation` always produces `UnsafeRow`s. Thus `PhysicalRDD.outputsUnsafeRows` can be properly set by checking whether the underlying data source is a `HadoopFsRelation`.
A remaining question is that, can we assume that all non-builtin `HadoopFsRelation` data sources output external rows? At least all well known ones do so. However it's possible that some users implemented their own `HadoopFsRelation` data sources that leverages `InternalRow` and thus all those unstable internal data representations. If this assumption is safe, we can deprecate `HadoopFsRelation.needConversion` and cleanup some more conversion code (like [here][2] and [here][3]).
This PR supersedes #9125.
Follow-ups:
1. Makes JSON and ORC data sources output `UnsafeRow` directly
1. Makes `HiveTableScan` output `UnsafeRow` directly
This is related to 1 since ORC data source shares the same `Writable` unwrapping code with `HiveTableScan`.
[1]: https://github.com/apache/spark/blob/v1.5.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala#L353
[2]: https://github.com/apache/spark/blob/v1.5.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L331-L335
[3]: https://github.com/apache/spark/blob/v1.5.1/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L630-L669
Author: Cheng Lian <lian@databricks.com>
Closes #9305 from liancheng/spark-11345.unsafe-hadoop-fs-relation.
Diffstat (limited to 'docs/_layouts')
0 files changed, 0 insertions, 0 deletions