[SPARK-14551][SQL] Reduce number of NameNode calls in OrcRelation - spark

diff options

author	Rajesh Balamohan <rbalamohan@apache.org>	2016-04-22 22:51:40 -0700
committer	Reynold Xin <rxin@databricks.com>	2016-04-22 22:51:40 -0700
commit	e5226e3007d6645c6d48d3c1b2762566184f3fc7 (patch)
tree	3fc4e6ea679ee8fb87d7d1aa97da2462199ecc4b /mllib
parent	95faa731c15ce2e36373071a405207165818df97 (diff)
download	spark-e5226e3007d6645c6d48d3c1b2762566184f3fc7.tar.gz spark-e5226e3007d6645c6d48d3c1b2762566184f3fc7.tar.bz2 spark-e5226e3007d6645c6d48d3c1b2762566184f3fc7.zip

[SPARK-14551][SQL] Reduce number of NameNode calls in OrcRelation

## What changes were proposed in this pull request? When FileSourceStrategy is used, record reader is created which incurs a NN call internally. Later in OrcRelation.unwrapOrcStructs, it ends ups reading the file information to get the ObjectInspector. This incurs additional NN call. It would be good to avoid this additional NN call (specifically for partitioned datasets). Added OrcRecordReader which is very similar to OrcNewInputFormat.OrcRecordReader with an option of exposing the ObjectInspector. This eliminates the need to look up the file later for generating the object inspector. This would be specifically be useful for partitioned tables/datasets. ## How was this patch tested? Ran tpc-ds queries manually and also verified by running org.apache.spark.sql.hive.orc.OrcSuite,org.apache.spark.sql.hive.orc.OrcQuerySuite,org.apache.spark.sql.hive.orc.OrcPartitionDiscoverySuite,OrcPartitionDiscoverySuite.OrcHadoopFsRelationSuite,org.apache.spark.sql.hive.execution.HiveCompatibilitySuite …SourceStrategy mode Author: Rajesh Balamohan <rbalamohan@apache.org> Closes #12319 from rajeshbalamohan/SPARK-14551.

Diffstat (limited to 'mllib')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: