[SPARK-16575][CORE] partition calculation mismatch with sc.binaryFiles

## What changes were proposed in this pull request? This Pull request comprises of the critical bug SPARK-16575 changes. This change rectifies the issue with BinaryFileRDD partition calculations as upon creating an RDD with sc.binaryFiles, the resulting RDD always just consisted of two partitions only. ## How was this patch tested? The original issue ie. getNumPartitions on binary Files RDD (always having two partitions) was first replicated and then tested upon the changes. Also the unit tests have been checked and passed. This contribution is my original work and I licence the work to the project under the project's open source license srowen hvanhovell rxin vanzin skyluc kmader zsxwing datafarmer Please have a look . Author: fidato <fidato.july13@gmail.com> Closes #15327 from fidato13/SPARK-16575.
author: fidato <fidato.july13@gmail.com> 2016-11-07 18:41:17 -0800
committer: Reynold Xin <rxin@databricks.com> 2016-11-07 18:41:17 -0800
commit: 6f3697136aa68dc39d3ce42f43a7af554d2a3bf9 (patch)
tree: 22b39fcfa5d7fa864a4921db67814093aa4c3c55 /docs
parent: 1da64e1fa0970277d1fb47dec8adca47b068b1ec (diff)
download: spark-6f3697136aa68dc39d3ce42f43a7af554d2a3bf9.tar.gz
spark-6f3697136aa68dc39d3ce42f43a7af554d2a3bf9.tar.bz2
spark-6f3697136aa68dc39d3ce42f43a7af554d2a3bf9.zip
1 files changed, 16 insertions, 0 deletions
diff --git a/docs/configuration.md b/docs/configuration.md
index 0017219e07..d0acd944dd 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -1035,6 +1035,22 @@ Apart from these, the following properties are also available, and may be useful
   </td>
 </tr>
 <tr>
+  <td><code>spark.files.maxPartitionBytes</code></td>
+  <td>134217728 (128 MB)</td>
+  <td>
+    The maximum number of bytes to pack into a single partition when reading files.
+  </td>
+</tr>
+<tr>
+  <td><code>spark.files.openCostInBytes</code></td>
+  <td>4194304 (4 MB)</td>
+  <td>
+    The estimated cost to open a file, measured by the number of bytes could be scanned in the same
+    time. This is used when putting multiple files into a partition. It is better to over estimate,
+    then the partitions with small files will be faster than partitions with bigger files.
+  </td>
+</tr>
+<tr>
     <td><code>spark.hadoop.cloneConf</code></td>
     <td>false</td>
     <td>If set to true, clones a new Hadoop <code>Configuration</code> object for each task.  This
author	fidato <fidato.july13@gmail.com>	2016-11-07 18:41:17 -0800
committer	Reynold Xin <rxin@databricks.com>	2016-11-07 18:41:17 -0800
commit	6f3697136aa68dc39d3ce42f43a7af554d2a3bf9 (patch)
tree	22b39fcfa5d7fa864a4921db67814093aa4c3c55 /docs
parent	1da64e1fa0970277d1fb47dec8adca47b068b1ec (diff)
download	spark-6f3697136aa68dc39d3ce42f43a7af554d2a3bf9.tar.gz spark-6f3697136aa68dc39d3ce42f43a7af554d2a3bf9.tar.bz2 spark-6f3697136aa68dc39d3ce42f43a7af554d2a3bf9.zip