aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorfidato <fidato.july13@gmail.com>2016-11-07 18:41:17 -0800
committerReynold Xin <rxin@databricks.com>2016-11-07 18:41:17 -0800
commit6f3697136aa68dc39d3ce42f43a7af554d2a3bf9 (patch)
tree22b39fcfa5d7fa864a4921db67814093aa4c3c55 /docs
parent1da64e1fa0970277d1fb47dec8adca47b068b1ec (diff)
downloadspark-6f3697136aa68dc39d3ce42f43a7af554d2a3bf9.tar.gz
spark-6f3697136aa68dc39d3ce42f43a7af554d2a3bf9.tar.bz2
spark-6f3697136aa68dc39d3ce42f43a7af554d2a3bf9.zip
[SPARK-16575][CORE] partition calculation mismatch with sc.binaryFiles
## What changes were proposed in this pull request? This Pull request comprises of the critical bug SPARK-16575 changes. This change rectifies the issue with BinaryFileRDD partition calculations as upon creating an RDD with sc.binaryFiles, the resulting RDD always just consisted of two partitions only. ## How was this patch tested? The original issue ie. getNumPartitions on binary Files RDD (always having two partitions) was first replicated and then tested upon the changes. Also the unit tests have been checked and passed. This contribution is my original work and I licence the work to the project under the project's open source license srowen hvanhovell rxin vanzin skyluc kmader zsxwing datafarmer Please have a look . Author: fidato <fidato.july13@gmail.com> Closes #15327 from fidato13/SPARK-16575.
Diffstat (limited to 'docs')
-rw-r--r--docs/configuration.md16
1 files changed, 16 insertions, 0 deletions
diff --git a/docs/configuration.md b/docs/configuration.md
index 0017219e07..d0acd944dd 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -1035,6 +1035,22 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
+ <td><code>spark.files.maxPartitionBytes</code></td>
+ <td>134217728 (128 MB)</td>
+ <td>
+ The maximum number of bytes to pack into a single partition when reading files.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.files.openCostInBytes</code></td>
+ <td>4194304 (4 MB)</td>
+ <td>
+ The estimated cost to open a file, measured by the number of bytes could be scanned in the same
+ time. This is used when putting multiple files into a partition. It is better to over estimate,
+ then the partitions with small files will be faster than partitions with bigger files.
+ </td>
+</tr>
+<tr>
<td><code>spark.hadoop.cloneConf</code></td>
<td>false</td>
<td>If set to true, clones a new Hadoop <code>Configuration</code> object for each task. This