diff options
author | Rajesh Balamohan <rbalamohan@apache.org> | 2016-04-22 22:51:40 -0700 |
---|---|---|
committer | Reynold Xin <rxin@databricks.com> | 2016-04-22 22:51:40 -0700 |
commit | e5226e3007d6645c6d48d3c1b2762566184f3fc7 (patch) | |
tree | 3fc4e6ea679ee8fb87d7d1aa97da2462199ecc4b /mllib | |
parent | 95faa731c15ce2e36373071a405207165818df97 (diff) | |
download | spark-e5226e3007d6645c6d48d3c1b2762566184f3fc7.tar.gz spark-e5226e3007d6645c6d48d3c1b2762566184f3fc7.tar.bz2 spark-e5226e3007d6645c6d48d3c1b2762566184f3fc7.zip |
[SPARK-14551][SQL] Reduce number of NameNode calls in OrcRelation
## What changes were proposed in this pull request?
When FileSourceStrategy is used, record reader is created which incurs a NN call internally. Later in OrcRelation.unwrapOrcStructs, it ends ups reading the file information to get the ObjectInspector. This incurs additional NN call. It would be good to avoid this additional NN call (specifically for partitioned datasets).
Added OrcRecordReader which is very similar to OrcNewInputFormat.OrcRecordReader with an option of exposing the ObjectInspector. This eliminates the need to look up the file later for generating the object inspector. This would be specifically be useful for partitioned tables/datasets.
## How was this patch tested?
Ran tpc-ds queries manually and also verified by running org.apache.spark.sql.hive.orc.OrcSuite,org.apache.spark.sql.hive.orc.OrcQuerySuite,org.apache.spark.sql.hive.orc.OrcPartitionDiscoverySuite,OrcPartitionDiscoverySuite.OrcHadoopFsRelationSuite,org.apache.spark.sql.hive.execution.HiveCompatibilitySuite
…SourceStrategy mode
Author: Rajesh Balamohan <rbalamohan@apache.org>
Closes #12319 from rajeshbalamohan/SPARK-14551.
Diffstat (limited to 'mllib')
0 files changed, 0 insertions, 0 deletions