aboutsummaryrefslogtreecommitdiff
path: root/mllib
diff options
context:
space:
mode:
authorRajesh Balamohan <rbalamohan@apache.org>2016-04-22 22:51:40 -0700
committerReynold Xin <rxin@databricks.com>2016-04-22 22:51:40 -0700
commite5226e3007d6645c6d48d3c1b2762566184f3fc7 (patch)
tree3fc4e6ea679ee8fb87d7d1aa97da2462199ecc4b /mllib
parent95faa731c15ce2e36373071a405207165818df97 (diff)
downloadspark-e5226e3007d6645c6d48d3c1b2762566184f3fc7.tar.gz
spark-e5226e3007d6645c6d48d3c1b2762566184f3fc7.tar.bz2
spark-e5226e3007d6645c6d48d3c1b2762566184f3fc7.zip
[SPARK-14551][SQL] Reduce number of NameNode calls in OrcRelation
## What changes were proposed in this pull request? When FileSourceStrategy is used, record reader is created which incurs a NN call internally. Later in OrcRelation.unwrapOrcStructs, it ends ups reading the file information to get the ObjectInspector. This incurs additional NN call. It would be good to avoid this additional NN call (specifically for partitioned datasets). Added OrcRecordReader which is very similar to OrcNewInputFormat.OrcRecordReader with an option of exposing the ObjectInspector. This eliminates the need to look up the file later for generating the object inspector. This would be specifically be useful for partitioned tables/datasets. ## How was this patch tested? Ran tpc-ds queries manually and also verified by running org.apache.spark.sql.hive.orc.OrcSuite,org.apache.spark.sql.hive.orc.OrcQuerySuite,org.apache.spark.sql.hive.orc.OrcPartitionDiscoverySuite,OrcPartitionDiscoverySuite.OrcHadoopFsRelationSuite,org.apache.spark.sql.hive.execution.HiveCompatibilitySuite …SourceStrategy mode Author: Rajesh Balamohan <rbalamohan@apache.org> Closes #12319 from rajeshbalamohan/SPARK-14551.
Diffstat (limited to 'mllib')
0 files changed, 0 insertions, 0 deletions