From 9a0272fbb322042788f14e9cd99e2db86b456225 Mon Sep 17 00:00:00 2001
From: Yuhao Yang <hhbyyh@gmail.com>
Date: Tue, 10 Mar 2015 10:51:44 +0000
Subject: [SPARK-6177][MLlib]Add note in LDA example to remind possible
 coalesce

JIRA: https://issues.apache.org/jira/browse/SPARK-6177
Add comment to introduce coalesce to LDA example to avoid the possible massive partitions from `sc.textFile`.

sc.textFile will create RDD with one partition for each file, and the possible massive partitions downgrades LDA performance.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #4899 from hhbyyh/adjustPartition and squashes the following commits:

a499630 [Yuhao Yang] update comment
9a2d7b6 [Yuhao Yang] move to comment
f7fd5d4 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into adjustPartition
26a564a [Yuhao Yang] add coalesce to LDAExample
---
 .../src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala   | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala b/examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala
index 11399a7633..08a93595a2 100644
--- a/examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala
+++ b/examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala
@@ -173,7 +173,9 @@ object LDAExample {
       stopwordFile: String): (RDD[(Long, Vector)], Array[String], Long) = {
 
     // Get dataset of document texts
-    // One document per line in each text file.
+    // One document per line in each text file. If the input consists of many small files,
+    // this can result in a large number of small partitions, which can degrade performance.
+    // In this case, consider using coalesce() to create fewer, larger partitions.
     val textRDD: RDD[String] = sc.textFile(paths.mkString(","))
 
     // Split text into words
-- 
cgit v1.2.3