aboutsummaryrefslogtreecommitdiff
path: root/docs/programming-guide.md
diff options
context:
space:
mode:
authorMatthew Farrellee <matt@redhat.com>2014-09-19 14:31:50 -0700
committerJosh Rosen <joshrosen@apache.org>2014-09-19 14:31:50 -0700
commitbe0c7563ea001a59469dbba219d2a8ef5785afa3 (patch)
tree5a513bc9a6b7a7f04d72c4b22882db91bb3e9196 /docs/programming-guide.md
parenta48956f5825d2255736eee50de79fba79bcb7e39 (diff)
downloadspark-be0c7563ea001a59469dbba219d2a8ef5785afa3.tar.gz
spark-be0c7563ea001a59469dbba219d2a8ef5785afa3.tar.bz2
spark-be0c7563ea001a59469dbba219d2a8ef5785afa3.zip
[SPARK-1701] Clarify slice vs partition in the programming guide
This is a partial solution to SPARK-1701, only addressing the documentation confusion. Additional work can be to actually change the numSlices parameter name across languages, with care required for scala & python to maintain backward compatibility for named parameters. Author: Matthew Farrellee <matt@redhat.com> Closes #2305 from mattf/SPARK-1701 and squashes the following commits: c0af05d [Matthew Farrellee] Further tweak 06f80fc [Matthew Farrellee] Wording tweak from Josh Rosen's review 7b045e0 [Matthew Farrellee] [SPARK-1701] Clarify slice vs partition in the programming guide
Diffstat (limited to 'docs/programming-guide.md')
-rw-r--r--docs/programming-guide.md8
1 files changed, 4 insertions, 4 deletions
diff --git a/docs/programming-guide.md b/docs/programming-guide.md
index 624cc744df..01d378af57 100644
--- a/docs/programming-guide.md
+++ b/docs/programming-guide.md
@@ -286,7 +286,7 @@ We describe operations on distributed datasets later on.
</div>
-One important parameter for parallel collections is the number of *slices* to cut the dataset into. Spark will run one task for each slice of the cluster. Typically you want 2-4 slices for each CPU in your cluster. Normally, Spark tries to set the number of slices automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to `parallelize` (e.g. `sc.parallelize(data, 10)`).
+One important parameter for parallel collections is the number of *partitions* to cut the dataset into. Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to `parallelize` (e.g. `sc.parallelize(data, 10)`). Note: some places in the code use the term slices (a synonym for partitions) to maintain backward compatibility.
## External Datasets
@@ -311,7 +311,7 @@ Some notes on reading files with Spark:
* All of Spark's file-based input methods, including `textFile`, support running on directories, compressed files, and wildcards as well. For example, you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and `textFile("/my/directory/*.gz")`.
-* The `textFile` method also takes an optional second argument for controlling the number of slices of the file. By default, Spark creates one slice for each block of the file (blocks being 64MB by default in HDFS), but you can also ask for a higher number of slices by passing a larger value. Note that you cannot have fewer slices than blocks.
+* The `textFile` method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 64MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.
Apart from text files, Spark's Scala API also supports several other data formats:
@@ -343,7 +343,7 @@ Some notes on reading files with Spark:
* All of Spark's file-based input methods, including `textFile`, support running on directories, compressed files, and wildcards as well. For example, you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and `textFile("/my/directory/*.gz")`.
-* The `textFile` method also takes an optional second argument for controlling the number of slices of the file. By default, Spark creates one slice for each block of the file (blocks being 64MB by default in HDFS), but you can also ask for a higher number of slices by passing a larger value. Note that you cannot have fewer slices than blocks.
+* The `textFile` method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 64MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.
Apart from text files, Spark's Java API also supports several other data formats:
@@ -375,7 +375,7 @@ Some notes on reading files with Spark:
* All of Spark's file-based input methods, including `textFile`, support running on directories, compressed files, and wildcards as well. For example, you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and `textFile("/my/directory/*.gz")`.
-* The `textFile` method also takes an optional second argument for controlling the number of slices of the file. By default, Spark creates one slice for each block of the file (blocks being 64MB by default in HDFS), but you can also ask for a higher number of slices by passing a larger value. Note that you cannot have fewer slices than blocks.
+* The `textFile` method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 64MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.
Apart from text files, Spark's Python API also supports several other data formats: