aboutsummaryrefslogtreecommitdiff
path: root/examples
diff options
context:
space:
mode:
authorYuhao Yang <hhbyyh@gmail.com>2015-03-10 10:51:44 +0000
committerSean Owen <sowen@cloudera.com>2015-03-10 10:52:21 +0000
commit9a0272fbb322042788f14e9cd99e2db86b456225 (patch)
tree107e0b9d88ae45ba2efaaba2f77941f9e681a982 /examples
parent8767565cef01d847f57b7293d8b63b2422009b90 (diff)
downloadspark-9a0272fbb322042788f14e9cd99e2db86b456225.tar.gz
spark-9a0272fbb322042788f14e9cd99e2db86b456225.tar.bz2
spark-9a0272fbb322042788f14e9cd99e2db86b456225.zip
[SPARK-6177][MLlib]Add note in LDA example to remind possible coalesce
JIRA: https://issues.apache.org/jira/browse/SPARK-6177 Add comment to introduce coalesce to LDA example to avoid the possible massive partitions from `sc.textFile`. sc.textFile will create RDD with one partition for each file, and the possible massive partitions downgrades LDA performance. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #4899 from hhbyyh/adjustPartition and squashes the following commits: a499630 [Yuhao Yang] update comment 9a2d7b6 [Yuhao Yang] move to comment f7fd5d4 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into adjustPartition 26a564a [Yuhao Yang] add coalesce to LDAExample
Diffstat (limited to 'examples')
-rw-r--r--examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala4
1 files changed, 3 insertions, 1 deletions
diff --git a/examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala b/examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala
index 11399a7633..08a93595a2 100644
--- a/examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala
+++ b/examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala
@@ -173,7 +173,9 @@ object LDAExample {
stopwordFile: String): (RDD[(Long, Vector)], Array[String], Long) = {
// Get dataset of document texts
- // One document per line in each text file.
+ // One document per line in each text file. If the input consists of many small files,
+ // this can result in a large number of small partitions, which can degrade performance.
+ // In this case, consider using coalesce() to create fewer, larger partitions.
val textRDD: RDD[String] = sc.textFile(paths.mkString(","))
// Split text into words