[SPARK-17666] Ensure that RecordReaders are closed by data source file scans

## What changes were proposed in this pull request? This patch addresses a potential cause of resource leaks in data source file scans. As reported in [SPARK-17666](https://issues.apache.org/jira/browse/SPARK-17666), tasks which do not fully-consume their input may cause file handles / network connections (e.g. S3 connections) to be leaked. Spark's `NewHadoopRDD` uses a TaskContext callback to [close its record readers](https://github.com/apache/spark/blame/master/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L208), but the new data source file scans will only close record readers once their iterators are fully-consumed. This patch modifies `RecordReaderIterator` and `HadoopFileLinesReader` to add `close()` methods and modifies all six implementations of `FileFormat.buildReader()` to register TaskContext task completion callbacks to guarantee that cleanup is eventually performed. ## How was this patch tested? Tested manually for now. Author: Josh Rosen <joshrosen@databricks.com> Closes #15245 from JoshRosen/SPARK-17666-close-recordreader.
author: Josh Rosen <joshrosen@databricks.com> 2016-09-27 17:52:57 -0700
committer: Reynold Xin <rxin@databricks.com> 2016-09-27 17:52:57 -0700
commit: b03b4adf6d8f4c6d92575c0947540cb474bf7de1 (patch)
tree: 78f32a0c01b2471c8fb670fd9b997200c5a162f1 /sql/hive/src
parent: e7bce9e1876de6ee975ccc89351db58119674aef (diff)
download: spark-b03b4adf6d8f4c6d92575c0947540cb474bf7de1.tar.gz
spark-b03b4adf6d8f4c6d92575c0947540cb474bf7de1.tar.bz2
spark-b03b4adf6d8f4c6d92575c0947540cb474bf7de1.zip
1 files changed, 5 insertions, 1 deletions
diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala
index 03b508e11a..15b72d8d21 100644
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala
@@ -31,6 +31,7 @@ import org.apache.hadoop.mapred.{JobConf, OutputFormat => MapRedOutputFormat, Re
 import org.apache.hadoop.mapreduce._
 import org.apache.hadoop.mapreduce.lib.input.{FileInputFormat, FileSplit}
 
+import org.apache.spark.TaskContext
 import org.apache.spark.sql.{Row, SparkSession}
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.expressions._
@@ -146,12 +147,15 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable
           new SparkOrcNewRecordReader(orcReader, conf, fileSplit.getStart, fileSplit.getLength)
         }
 
+        val recordsIterator = new RecordReaderIterator[OrcStruct](orcRecordReader)
+        Option(TaskContext.get()).foreach(_.addTaskCompletionListener(_ => recordsIterator.close()))
+
         // Unwraps `OrcStruct`s to `UnsafeRow`s
         OrcRelation.unwrapOrcStructs(
           conf,
           requiredSchema,
           Some(orcRecordReader.getObjectInspector.asInstanceOf[StructObjectInspector]),
-          new RecordReaderIterator[OrcStruct](orcRecordReader))
+          recordsIterator)
       }
     }
   }
author	Josh Rosen <joshrosen@databricks.com>	2016-09-27 17:52:57 -0700
committer	Reynold Xin <rxin@databricks.com>	2016-09-27 17:52:57 -0700
commit	b03b4adf6d8f4c6d92575c0947540cb474bf7de1 (patch)
tree	78f32a0c01b2471c8fb670fd9b997200c5a162f1 /sql/hive/src
parent	e7bce9e1876de6ee975ccc89351db58119674aef (diff)
download	spark-b03b4adf6d8f4c6d92575c0947540cb474bf7de1.tar.gz spark-b03b4adf6d8f4c6d92575c0947540cb474bf7de1.tar.bz2 spark-b03b4adf6d8f4c6d92575c0947540cb474bf7de1.zip