diff options
-rw-r--r-- | docs/streaming-programming-guide.md | 18 |
1 files changed, 10 insertions, 8 deletions
diff --git a/docs/streaming-programming-guide.md b/docs/streaming-programming-guide.md index 939599aa68..0c125eb693 100644 --- a/docs/streaming-programming-guide.md +++ b/docs/streaming-programming-guide.md @@ -522,9 +522,9 @@ common ones are as follows. <td> <b>reduceByKey</b>(<i>func</i>, [<i>numTasks</i>]) </td> <td> When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. <b>Note:</b> By default, - this uses Spark's default number of parallel tasks (2 for local machine, 8 for a cluster) to - do the grouping. You can pass an optional <code>numTasks</code> argument to set a different - number of tasks.</td> + this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number + is determined by the config property <code>spark.default.parallelism</code>) to do the grouping. + You can pass an optional <code>numTasks</code> argument to set a different number of tasks.</td> </tr> <tr> <td> <b>join</b>(<i>otherStream</i>, [<i>numTasks</i>]) </td> @@ -743,8 +743,9 @@ said two parameters - <i>windowLength</i> and <i>slideInterval</i>. <td> When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function <i>func</i> over batches in a sliding window. <b>Note:</b> By default, this uses Spark's default number of - parallel tasks (2 for local machine, 8 for a cluster) to do the grouping. You can pass an optional - <code>numTasks</code> argument to set a different number of tasks. + parallel tasks (2 for local mode, and in cluster mode the number is determined by the config + property <code>spark.default.parallelism</code>) to do the grouping. You can pass an optional + <code>numTasks</code> argument to set a different number of tasks. </td> </tr> <tr> @@ -956,9 +957,10 @@ before further processing. ### Level of Parallelism in Data Processing Cluster resources maybe under-utilized if the number of parallel tasks used in any stage of the computation is not high enough. For example, for distributed reduce operations like `reduceByKey` -and `reduceByKeyAndWindow`, the default number of parallel tasks is 8. You can pass the level of -parallelism as an argument (see the -[`PairDStreamFunctions`](api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions) +and `reduceByKeyAndWindow`, the default number of parallel tasks is decided by the [config property] +(configuration.html#spark-properties) `spark.default.parallelism`. You can pass the level of +parallelism as an argument (see [`PairDStreamFunctions`] +(api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions) documentation), or set the [config property](configuration.html#spark-properties) `spark.default.parallelism` to change the default. |