aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* Revert "Preparing development version 1.2.2-SNAPSHOT"Patrick Wendell2015-01-2729-29/+29
| | | | This reverts commit f53a4319ba5f0843c077e64ae5a41e2fac835a5b.
* [MLlib] fix python example of ALS in guideDavies Liu2015-01-271-6/+5
| | | | | | | | | | | | | fix python example of ALS in guide, use Rating instead of np.array. Author: Davies Liu <davies@databricks.com> Closes #4226 from davies/fix_als_guide and squashes the following commits: 1433d76 [Davies Liu] fix python example of als in guide (cherry picked from commit fdaad4eb0388cfe43b5b6600927eb7b9182646f9) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* SPARK-5308 [BUILD] MD5 / SHA1 hash format doesn't match standard Maven outputSean Owen2015-01-271-2/+8
| | | | | | | | | | | | | | Here's one way to make the hashes match what Maven's plugins would create. It takes a little extra footwork since OS X doesn't have the same command line tools. An alternative is just to make Maven output these of course - would that be better? I ask in case there is a reason I'm missing, like, we need to hash files that Maven doesn't build. Author: Sean Owen <sowen@cloudera.com> Closes #4161 from srowen/SPARK-5308 and squashes the following commits: 70d09d0 [Sean Owen] Use $(...) syntax e25eff8 [Sean Owen] Generate MD5, SHA1 hashes in a format like Maven's plugin (cherry picked from commit ff356e2a21e31998cda3062e560a276a3bfaa7ab) Signed-off-by: Patrick Wendell <patrick@databricks.com>
* Preparing development version 1.2.2-SNAPSHOTPatrick Wendell2015-01-2729-29/+29
|
* Preparing Spark release v1.2.1-rc1Patrick Wendell2015-01-2729-29/+29
|
* Revert "Preparing Spark release v1.2.1-rc1"Patrick Wendell2015-01-2629-29/+29
| | | | This reverts commit e87eb2b42f137c22194cfbca2abf06fecdf943da.
* Revert "Preparing development version 1.2.2-SNAPSHOT"Patrick Wendell2015-01-2629-29/+29
| | | | This reverts commit adfed7086f10fa8db4eeac7996c84cf98f625e9a.
* Preparing development version 1.2.2-SNAPSHOTUbuntu2015-01-2729-29/+29
|
* Preparing Spark release v1.2.1-rc1Ubuntu2015-01-2729-29/+29
|
* Updating versions for Spark 1.2.1Patrick Wendell2015-01-263-4/+5
|
* SPARK-4147 [CORE] Reduce log4j dependencySean Owen2015-01-261-9/+11
| | | | | | | | | | | | | Defer use of log4j class until it's known that log4j 1.2 is being used. This may avoid dealing with log4j dependencies for callers that reroute slf4j to another logging framework. The only change is to push one half of the check in the original `if` condition inside. This is a trivial change, may or may not actually solve a problem, but I think it's all that makes sense to do for SPARK-4147. Author: Sean Owen <sowen@cloudera.com> Closes #4190 from srowen/SPARK-4147 and squashes the following commits: 4e99942 [Sean Owen] Defer use of log4j class until it's known that log4j 1.2 is being used. This may avoid dealing with log4j dependencies for callers that reroute slf4j to another logging framework. (cherry picked from commit 54e7b456dd56c9e52132154e699abca87563465b) Signed-off-by: Patrick Wendell <patrick@databricks.com>
* [SPARK-5355] use j.u.c.ConcurrentHashMap instead of TrieMapDavies Liu2015-01-263-21/+23
| | | | | | | | | | | | | | | | | | | | j.u.c.ConcurrentHashMap is more battle tested. cc rxin JoshRosen pwendell Author: Davies Liu <davies@databricks.com> Closes #4208 from davies/safe-conf and squashes the following commits: c2182dc [Davies Liu] address comments, fix tests 3a1d821 [Davies Liu] fix test da14ced [Davies Liu] Merge branch 'master' of github.com:apache/spark into safe-conf ae4d305 [Davies Liu] change to j.u.c.ConcurrentMap f8fa1cf [Davies Liu] change to TrieMap a1d769a [Davies Liu] make SparkConf thread-safe (cherry picked from commit 142093179a4c40bdd90744191034de7b94a963ff) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* SPARK-4430 [STREAMING] [TEST] Apache RAT Checks fail spuriously on test filesSean Owen2015-01-251-7/+2
| | | | | | | | | | | | | Another trivial one. The RAT failure was due to temp files from `FailureSuite` not being cleaned up. This just makes the cleanup more reliable by using the standard temp dir mechanism. Author: Sean Owen <sowen@cloudera.com> Closes #4189 from srowen/SPARK-4430 and squashes the following commits: 9ea63ff [Sean Owen] Properly acquire a temp directory to ensure it is cleaned up at shutdown, which helps avoid a RAT check failure (cherry picked from commit 0528b85cf96f9c9c074b5fbb5b9c5dd8071c0bc7) Signed-off-by: Andrew Or <andrew@databricks.com>
* Revert "[SPARK-5344][WebUI] HistoryServer cannot recognize that inprogress ↵Andrew Or2015-01-251-3/+1
| | | | | | file was renamed to completed file" This reverts commit 8f55beeb51e6ea72e63af3f276497f61dd24d09b.
* [SPARK-5344][WebUI] HistoryServer cannot recognize that inprogress file was ↵Kousuke Saruta2015-01-251-1/+3
| | | | | | | | | | | | | | | | | | | renamed to completed file `FsHistoryProvider` tries to update application status but if `checkForLogs` is called before `.inprogress` file is renamed to completed file, the file is not recognized as completed. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #4132 from sarutak/SPARK-5344 and squashes the following commits: 9658008 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5344 d2c72b6 [Kousuke Saruta] Fixed update issue of FsHistoryProvider (cherry picked from commit 8f5c827b01026bf45fc774ed7387f11a941abea8) Signed-off-by: Andrew Or <andrew@databricks.com> Conflicts: core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala
* SPARK-4506 [DOCS] Addendum: Update more docs to reflect that standalone ↵Sean Owen2015-01-251-2/+2
| | | | | | | | | | | | | | | works in cluster mode This is a trivial addendum to SPARK-4506, which was already resolved. noted by Asim Jalis in SPARK-4506. Author: Sean Owen <sowen@cloudera.com> Closes #4160 from srowen/SPARK-4506 and squashes the following commits: 5f5f7df [Sean Owen] Update more docs to reflect that standalone works in cluster mode (cherry picked from commit 9f6435763d173d2abf82d16b5878983fa8bf3419) Signed-off-by: Andrew Or <andrew@databricks.com>
* SPARK-5382: Use SPARK_CONF_DIR in spark-class and spark-submit, spark-su...Jacek Lewandowski2015-01-252-2/+9
| | | | | | | | | | ...bmit2.cmd if it is defined Author: Jacek Lewandowski <lewandowski.jacek@gmail.com> Closes #4177 from jacek-lewandowski/SPARK-5382-1.2 and squashes the following commits: 41cef25 [Jacek Lewandowski] SPARK-5382: Use SPARK_CONF_DIR in spark-class and spark-submit, spark-submit2.cmd if it is defined
* SPARK-5382: Use SPARK_CONF_DIR in spark-class if it is definedJacek Lewandowski2015-01-251-2/+3
| | | | | | | | Author: Jacek Lewandowski <lewandowski.jacek@gmail.com> Closes #4179 from jacek-lewandowski/SPARK-5382-1.3 and squashes the following commits: 55d7791 [Jacek Lewandowski] SPARK-5382: Use SPARK_CONF_DIR in spark-class if it is defined
* SPARK-3852 [DOCS] Document spark.driver.extra* configsSean Owen2015-01-251-0/+21
| | | | | | | | | | | | | As per the JIRA. I copied the `spark.executor.extra*` text, but removed info that appears to be specific to the `executor` config and not `driver`. Author: Sean Owen <sowen@cloudera.com> Closes #4185 from srowen/SPARK-3852 and squashes the following commits: f60a8a1 [Sean Owen] Document spark.driver.extra* configs (cherry picked from commit c586b45dd25b50be7f195df2ce91b307e1ed71a9) Signed-off-by: Andrew Or <andrew@databricks.com>
* [SPARK-5402] log executor ID at executor-construction timeRyan Williams2015-01-251-5/+8
| | | | | | | | | | | | | | | | also rename "slaveHostname" to "executorHostname" Author: Ryan Williams <ryan.blake.williams@gmail.com> Closes #4195 from ryan-williams/exec and squashes the following commits: e60a7bb [Ryan Williams] log executor ID at executor-construction time (cherry picked from commit aea25482c370fbcf712a464501605bc16ee4ed5d) Signed-off-by: Andrew Or <andrew@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/executor/Executor.scala
* [SPARK-5401] set executor ID before creating MetricsSystemRyan Williams2015-01-252-2/+6
| | | | | | | | Author: Ryan Williams <ryan.blake.williams@gmail.com> Closes #4194 from ryan-williams/metrics and squashes the following commits: 7c5a33f [Ryan Williams] set executor ID before creating MetricsSystem
* [SPARK-5058] Part 2. Typos and broken URLJongyoul Lee2015-01-231-1/+1
| | | | | | | | | | | | | - Also fixed java link Author: Jongyoul Lee <jongyoul@gmail.com> Closes #4172 from jongyoul/SPARK-FIXDOC and squashes the following commits: 6be03e5 [Jongyoul Lee] [SPARK-5058] Part 2. Typos and broken URL - Also fixed java link (cherry picked from commit 09e09c548e7722fca1cdc89bd37de2cee58f4ce9) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-5351][GraphX] Do not use Partitioner.defaultPartitioner as a ↵Takeshi Yamamuro2015-01-232-2/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | partitioner of EdgeRDDImp... If the value of 'spark.default.parallelism' does not match the number of partitoins in EdgePartition(EdgeRDDImpl), the following error occurs in ReplicatedVertexView.scala:72; object GraphTest extends Logging { def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): VertexRDD[Int] = { graph.aggregateMessages( ctx => { ctx.sendToSrc(1) ctx.sendToDst(2) }, _ + _) } } val g = GraphLoader.edgeListFile(sc, "graph.txt") val rdd = GraphTest.run(g) java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:204) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:204) at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:82) at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:193) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:191) ... Author: Takeshi Yamamuro <linguin.m.s@gmail.com> Closes #4136 from maropu/EdgePartitionBugFix and squashes the following commits: 0cd8942 [Ankur Dave] Use more concise getOrElse aad4a2c [Ankur Dave] Add unit test for non-default number of edge partitions 0a2f32b [Takeshi Yamamuro] Do not use Partitioner.defaultPartitioner as a partitioner of EdgeRDDImpl (cherry picked from commit e224dbb011789297cd6c6ba095f702c042869ed6) Signed-off-by: Ankur Dave <ankurdave@gmail.com>
* [SPARK-5063] More helpful error messages for several invalid operationsJosh Rosen2015-01-236-14/+138
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds more helpful error messages for invalid programs that define nested RDDs, broadcast RDDs, perform actions inside of transformations (e.g. calling `count()` from inside of `map()`), and call certain methods on stopped SparkContexts. Currently, these invalid programs lead to confusing NullPointerExceptions at runtime and have been a major source of questions on the mailing list and StackOverflow. In a few cases, I chose to log warnings instead of throwing exceptions in order to avoid any chance that this patch breaks programs that worked "by accident" in earlier Spark releases (e.g. programs that define nested RDDs but never run any jobs with them). In SparkContext, the new `assertNotStopped()` method is used to check whether methods are being invoked on a stopped SparkContext. In some cases, user programs will not crash in spite of calling methods on stopped SparkContexts, so I've only added `assertNotStopped()` calls to methods that always throw exceptions when called on stopped contexts (e.g. by dereferencing a null `dagScheduler` pointer). Author: Josh Rosen <joshrosen@databricks.com> Closes #3884 from JoshRosen/SPARK-5063 and squashes the following commits: a38774b [Josh Rosen] Fix spelling typo a943e00 [Josh Rosen] Convert two exceptions into warnings in order to avoid breaking user programs in some edge-cases. 2d0d7f7 [Josh Rosen] Fix test to reflect 1.2.1 compatibility 3f0ea0c [Josh Rosen] Revert two unintentional formatting changes 8e5da69 [Josh Rosen] Remove assertNotStopped() calls for methods that were sometimes safe to call on stopped SC's in Spark 1.2 8cff41a [Josh Rosen] IllegalStateException fix 6ef68d0 [Josh Rosen] Fix Python line length issues. 9f6a0b8 [Josh Rosen] Add improved error messages to PySpark. 13afd0f [Josh Rosen] SparkException -> IllegalStateException 8d404f3 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-5063 b39e041 [Josh Rosen] Fix BroadcastSuite test which broadcasted an RDD 99cc09f [Josh Rosen] Guard against calling methods on stopped SparkContexts. 34833e8 [Josh Rosen] Add more descriptive error message. 57cc8a1 [Josh Rosen] Add error message when directly broadcasting RDD. 15b2e6b [Josh Rosen] [SPARK-5063] Useful error messages for nested RDDs and actions inside of transformations (cherry picked from commit cef1f092a628ac20709857b4388bb10e0b5143b0) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* [SPARK-5233][Streaming] Fix error replaying of WAL introduced bugjerryshao2015-01-224-20/+32
| | | | | | | | | | | | | | | | Because of lacking of `BlockAllocationEvent` in WAL recovery, the dangled event will mix into the new batch, which will lead to the wrong result. Details can be seen in [SPARK-5233](https://issues.apache.org/jira/browse/SPARK-5233). Author: jerryshao <saisai.shao@intel.com> Closes #4032 from jerryshao/SPARK-5233 and squashes the following commits: f0b0c0b [jerryshao] Further address the comments a237c75 [jerryshao] Address the comments e356258 [jerryshao] Fix bug in unit test 558bdc3 [jerryshao] Correctly replay the WAL log when recovering from failure (cherry picked from commit 3c3fa632e6ba45ce536065aa1145698385301fb2) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
* [HOTFIX] Fixed compilation error due to missing SparkContext._ implicit ↵Tathagata Das2015-01-221-0/+1
| | | | conversions.
* [SPARK-5147][Streaming] Delete the received data WAL log periodicallyTathagata Das2015-01-219-50/+172
| | | | | | | | | | | | | | | | | | | | | | | This is a refactored fix based on jerryshao 's PR #4037 This enabled deletion of old WAL files containing the received block data. Improvements over #4037 - Respecting the rememberDuration of all receiver streams. In #4037, if there were two receiver streams with multiple remember durations, the deletion would have delete based on the shortest remember duration, thus deleting data prematurely for the receiver stream with longer remember duration. - Added unit test to test creation of receiver WAL, automatic deletion, and respecting of remember duration. jerryshao I am going to merge this ASAP to make it 1.2.1 Thanks for the initial draft of this PR. Made my job much easier. Author: Tathagata Das <tathagata.das1565@gmail.com> Author: jerryshao <saisai.shao@intel.com> Closes #4149 from tdas/SPARK-5147 and squashes the following commits: 730798b [Tathagata Das] Added comments. c4cf067 [Tathagata Das] Minor fixes 2579b27 [Tathagata Das] Refactored the fix to make sure that the cleanup respects the remember duration of all the receiver streams 2736fd1 [jerryshao] Delete the old WAL log periodically (cherry picked from commit 3027f06b4127ab23a43c5ce8cebf721e3b6766e5) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
* [SPARK-5355] make SparkConf thread-safeDavies Liu2015-01-211-2/+3
| | | | | | | | | | | | | | | | The SparkConf is not thread-safe, but is accessed by many threads. The getAll() could return parts of the configs if another thread is access it. This PR changes SparkConf.settings to a thread-safe TrieMap. Author: Davies Liu <davies@databricks.com> Closes #4143 from davies/safe-conf and squashes the following commits: f8fa1cf [Davies Liu] change to TrieMap a1d769a [Davies Liu] make SparkConf thread-safe (cherry picked from commit 9bad062268676aaa66dcbddd1e0ab7f2d7742425) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* Make sure only owner can read / write to directories created for the job.Marcelo Vanzin2015-01-216-54/+68
| | | | | | | Whenever a directory is created by the utility method, immediately restrict its permissions so that only the owner has access to its contents. Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* [SPARK-5006][Deploy]spark.port.maxRetries doesn't workWangTaoTheTonic2015-01-2114-32/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-5006 I think the issue is produced in https://github.com/apache/spark/pull/1777. Not digging mesos's backend yet. Maybe should add same logic either. Author: WangTaoTheTonic <barneystinson@aliyun.com> Author: WangTao <barneystinson@aliyun.com> Closes #3841 from WangTaoTheTonic/SPARK-5006 and squashes the following commits: 8cdf96d [WangTao] indent thing 2d86d65 [WangTaoTheTonic] fix line length 7cdfd98 [WangTaoTheTonic] fit for new HttpServer constructor 61a370d [WangTaoTheTonic] some minor fixes bc6e1ec [WangTaoTheTonic] rebase 67bcb46 [WangTaoTheTonic] put conf at 3rd position, modify suite class, add comments f450cd1 [WangTaoTheTonic] startServiceOnPort will use a SparkConf arg 29b751b [WangTaoTheTonic] rebase as ExecutorRunnableUtil changed to ExecutorRunnable 396c226 [WangTaoTheTonic] make the grammar more like scala 191face [WangTaoTheTonic] invalid value name 62ec336 [WangTaoTheTonic] spark.port.maxRetries doesn't work Conflicts: external/mqtt/src/test/scala/org/apache/spark/streaming/mqtt/MQTTStreamSuite.scala
* [SPARK-5064][GraphX] Add numEdges upperbound validation for R-MAT graph ↵Kenji Kikushima2015-01-212-0/+16
| | | | | | | | | | | | | | | | | generator to prevent infinite loop I looked into GraphGenerators#chooseCell, and found that chooseCell can't generate more edges than pow(2, (2 * (log2(numVertices)-1))) to make a Power-law graph. (Ex. numVertices:4 upperbound:4, numVertices:8 upperbound:16, numVertices:16 upperbound:64) If we request more edges over the upperbound, rmatGraph fall into infinite loop. So, how about adding an argument validation? Author: Kenji Kikushima <kikushima.kenji@lab.ntt.co.jp> Closes #3950 from kj-ki/SPARK-5064 and squashes the following commits: 4ee18c7 [Ankur Dave] Reword error message and add unit test d760bc7 [Kenji Kikushima] Add numEdges upperbound validation for R-MAT graph generator to prevent infinite loop. (cherry picked from commit 3ee3ab592eee831d759c940eb68231817ad6d083) Signed-off-by: Ankur Dave <ankurdave@gmail.com>
* [SPARK-4161]Spark shell class path is not correctly set if ↵GuoQiang Li2015-01-211-0/+7
| | | | | | | | | | | | "spark.driver.extraClassPath" is set in defaults.conf Author: GuoQiang Li <witgo@qq.com> Closes #3050 from witgo/SPARK-4161 and squashes the following commits: abb6fa4 [GuoQiang Li] move usejavacp opt to spark-shell 89e39e7 [GuoQiang Li] review commit c2a6f04 [GuoQiang Li] Spark shell class path is not correctly set if "spark.driver.extraClassPath" is set in defaults.conf
* [SPARK-4569] Rename 'externalSorting' in AggregatorIlya Ganelin2015-01-211-4/+6
| | | | | | | | | | | | | Hi all - I've renamed the unhelpfully named variable and added a comment clarifying what's actually happening. Author: Ilya Ganelin <ilya.ganelin@capitalone.com> Closes #3666 from ilganeli/SPARK-4569B and squashes the following commits: 1810394 [Ilya Ganelin] [SPARK-4569] Rename 'externalSorting' in Aggregator e2d2092 [Ilya Ganelin] [SPARK-4569] Rename 'externalSorting' in Aggregator d7cefec [Ilya Ganelin] [SPARK-4569] Rename 'externalSorting' in Aggregator 5b3f39c [Ilya Ganelin] [SPARK-4569] Rename in Aggregator
* [SPARK-4759] Fix driver hanging from coalescing partitionsAndrew Or2015-01-212-16/+22
| | | | | | | | | | | | | | | | | | The driver hangs sometimes when we coalesce RDD partitions. See JIRA for more details and reproduction. This is because our use of empty string as default preferred location in `CoalescedRDDPartition` causes the `TaskSetManager` to schedule the corresponding task on host `""` (empty string). The intended semantics here, however, is that the partition does not have a preferred location, and the TSM should schedule the corresponding task accordingly. Author: Andrew Or <andrew@databricks.com> Closes #3633 from andrewor14/coalesce-preferred-loc and squashes the following commits: e520d6b [Andrew Or] Oops 3ebf8bd [Andrew Or] A few comments f370a4e [Andrew Or] Fix tests 2f7dfb6 [Andrew Or] Avoid using empty string as default preferred location (cherry picked from commit 4f93d0cabe5d1fc7c0fd0a33d992fd85df1fecb4) Signed-off-by: Andrew Or <andrew@databricks.com>
* [HOTFIX] Update pom.xml to pull MapR's Hadoop version 2.4.1.Kannan Rajah2015-01-201-3/+3
| | | | | | | | | | | Author: Kannan Rajah <rkannan82@gmail.com> Closes #4108 from rkannan82/master and squashes the following commits: eca095b [Kannan Rajah] Update pom.xml to pull MapR's Hadoop version 2.4.1. (cherry picked from commit ec5b0f2cef4b30047c7f88bdc00d10b6aa308124) Signed-off-by: Patrick Wendell <patrick@databricks.com>
* [SPARK-5275] [Streaming] include python source codeDavies Liu2015-01-201-0/+8
| | | | | | | | | | | | | | | | Include the python source code into assembly jar. cc mengxr pwendell Author: Davies Liu <davies@databricks.com> Closes #4128 from davies/build_streaming2 and squashes the following commits: 546af4c [Davies Liu] fix indent 48859b2 [Davies Liu] include python source code (cherry picked from commit bad6c5721167153d7ed834b49f87bf2980c6ed67) Signed-off-by: Patrick Wendell <patrick@databricks.com>
* [SPARK-4959][SQL] Attributes are case sensitive when using a select query ↵Cheng Hao2015-01-202-6/+17
| | | | | | | | | | | | from a projection(Backport to Spark-1.2) This is a follow up of #3796 , which can not be merged back to Spark-1.2. Manually merge it. Author: Cheng Hao <hao.cheng@intel.com> Closes #4013 from chenghao-intel/spark_4959_backport and squashes the following commits: 1f6c93d [Cheng Hao] backport to Spark-1.2
* SPARK-4660: Use correct class loader in JavaSerializer (copy of PR #3840...Jacek Lewandowski2015-01-201-1/+1
| | | | | | | | | | | | | ... by Piotr Kolaczkowski) Author: Jacek Lewandowski <lewandowski.jacek@gmail.com> Closes #4113 from jacek-lewandowski/SPARK-4660-master and squashes the following commits: a5e84ca [Jacek Lewandowski] SPARK-4660: Use correct class loader in JavaSerializer (copy of PR #3840 by Piotr Kolaczkowski) (cherry picked from commit c93a57f0d6dc32b127aa68dbe4092ab0b22a9667) Signed-off-by: Patrick Wendell <patrick@databricks.com>
* [SPARK-4803] [streaming] Remove duplicate RegisterReceiver messageIlayaperumal Gopinathan2015-01-202-9/+2
| | | | | | | | | | | | | | | | | | | | | - The ReceiverTracker receivers `RegisterReceiver` messages two times 1) When the actor at `ReceiverSupervisorImpl`'s preStart is invoked 2) After the receiver is started at the executor `onReceiverStart()` at `ReceiverSupervisorImpl` Though, RegisterReceiver message uses the same streamId and the receiverInfo gets updated everytime the message is processed at the `ReceiverTracker`, it makes sense to call register receiver only after the receiver is started. Author: Ilayaperumal Gopinathan <igopinathan@pivotal.io> Closes #3648 from ilayaperumalg/RTActor-remove-prestart and squashes the following commits: 868efab [Ilayaperumal Gopinathan] Increase receiverInfo collector timeout to 2 secs 3118e5e [Ilayaperumal Gopinathan] Fix StreamingListenerSuite's startedReceiverStreamIds size 634abde [Ilayaperumal Gopinathan] Remove duplicate RegisterReceiver message (cherry picked from commit 4afad9c7702239f6d5b1b49dc48ee08580964e17) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
* [SPARK-4504][Examples] fix run-example failure if multiple assembly jars existVenkata Ramana Gollamudi2015-01-192-18/+36
| | | | | | | | | | | | | | | | | | Fix run-example script to fail fast with useful error message if multiple example assembly JARs are present. Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com> Closes #3377 from gvramana/run-example_fails and squashes the following commits: fa7f481 [Venkata Ramana Gollamudi] Fixed review comments, avoiding ls output scanning. 6aa1ab7 [Venkata Ramana Gollamudi] Fix run-examples script error during multiple jars (cherry picked from commit 74de94ea6db96a04b278c6106264313504d7b8f3) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: bin/compute-classpath.sh
* [SPARK-5282][mllib]: RowMatrix easily gets int overflow in the memory size ↵Yuhao Yang2015-01-191-2/+2
| | | | | | | | | | | | | | | | | | warning JIRA: https://issues.apache.org/jira/browse/SPARK-5282 fix the possible int overflow in the memory computation warning Author: Yuhao Yang <hhbyyh@gmail.com> Closes #4069 from hhbyyh/addscStop and squashes the following commits: e54e5c8 [Yuhao Yang] change to MB based number 7afac23 [Yuhao Yang] 5282: fix int overflow in the warning (cherry picked from commit 4432568aac1d4a44fa1a7c3469f095eb7a6ce945) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-5289]: Backport publishing of repl, yarn into branch-1.2.Patrick Wendell2015-01-172-28/+0
| | | | | | | | | | | | This change was done in SPARK-4048 as part of a larger refactoring, but we need to backport this publishing of yarn and repl into Spark 1.2, so that we can cut a 1.2.1 release with these artifacts. Author: Patrick Wendell <patrick@databricks.com> Closes #4079 from pwendell/skip-deps and squashes the following commits: 807b833 [Patrick Wendell] [SPARK-5289]: Backport publishing of repl, yarn into branch-1.2.
* [SPARK-733] Add documentation on use of accumulators in lazy transformationIlya Ganelin2015-01-161-0/+28
| | | | | | | | | | | | | | | | | | | | | | I've added documentation clarifying the particular lack of clarity highlighted in the relevant JIRA. I've also added code examples for this issue to clarify the explanation. Author: Ilya Ganelin <ilya.ganelin@capitalone.com> Closes #4022 from ilganeli/SPARK-733 and squashes the following commits: 587def5 [Ilya Ganelin] Updated to clarify verbage df3afd7 [Ilya Ganelin] Revert "Partially updated task metrics to make some vars private" 3f6c512 [Ilya Ganelin] Revert "Completed refactoring to make vars in TaskMetrics class private" 58034fb [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-733 4dc2cdb [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-733 3a38db1 [Ilya Ganelin] Verified documentation update by building via jekyll 33b5a2d [Ilya Ganelin] Added code examples for java and python 1fd59b2 [Ilya Ganelin] Updated documentation for accumulators to highlight lazy evaluation issue 5525c20 [Ilya Ganelin] Completed refactoring to make vars in TaskMetrics class private c64da4f [Ilya Ganelin] Partially updated task metrics to make some vars private (cherry picked from commit fd3a8a1d15ad516ea056089e30d6fd14e2f2d9a1) Signed-off-by: Imran Rashid <irashid@cloudera.com>
* [DOCS] Fix typo in return type of cogroupSean Owen2015-01-161-1/+1
| | | | | | | | | | | | | | | This fixes a simple typo in the cogroup docs noted in http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAMAsSdJ8_24evMAMg7fOZCQjwimisbYWa9v8BN6Rc3JCauja6wmail.gmail.com%3E I didn't bother with a JIRA Author: Sean Owen <sowen@cloudera.com> Closes #4072 from srowen/CogroupDocFix and squashes the following commits: 43c850b [Sean Owen] Fix typo in return type of cogroup (cherry picked from commit f6b852aade7668c99f37c69f606c64763cb265d2) Signed-off-by: Andrew Or <andrew@databricks.com>
* [SPARK-5201][CORE] deal with int overflow in the ParallelCollectionRDD.slice ↵Ye Xianjin2015-01-163-16/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | method There is an int overflow in the ParallelCollectionRDD.slice method. That's originally reported by SaintBacchus. ``` sc.makeRDD(1 to (Int.MaxValue)).count // result = 0 sc.makeRDD(1 to (Int.MaxValue - 1)).count // result = 2147483646 = Int.MaxValue - 1 sc.makeRDD(1 until (Int.MaxValue)).count // result = 2147483646 = Int.MaxValue - 1 ``` see https://github.com/apache/spark/pull/2874 for more details. This pr try to fix the overflow. However, There's another issue I don't address. ``` val largeRange = Int.MinValue to Int.MaxValue largeRange.length // throws java.lang.IllegalArgumentException: -2147483648 to 2147483647 by 1: seqs cannot contain more than Int.MaxValue elements. ``` So, the range we feed to sc.makeRDD cannot contain more than Int.MaxValue elements. This is the limitation of Scala. However I think we may want to support that kind of range. But the fix is beyond this pr. srowen andrewor14 would you mind take a look at this pr? Author: Ye Xianjin <advancedxy@gmail.com> Closes #4002 from advancedxy/SPARk-5201 and squashes the following commits: 96265a1 [Ye Xianjin] Update slice method comment and some responding docs. e143d7a [Ye Xianjin] Update inclusive range check for splitting inclusive range. b3f5577 [Ye Xianjin] We can include the last element in the last slice in general for inclusive range, hence eliminate the need to check Int.MaxValue or Int.MinValue. 7d39b9e [Ye Xianjin] Convert the two cases pattern matching to one case. 651c959 [Ye Xianjin] rename sign to needsInclusiveRange. add some comments 196f8a8 [Ye Xianjin] Add test cases for ranges end with Int.MaxValue or Int.MinValue e66e60a [Ye Xianjin] Deal with inclusive and exclusive ranges in one case. If the range is inclusive and the end of the range is (Int.MaxValue or Int.MinValue), we should use inclusive range instead of exclusive
* [SPARK-4033][Examples]Input of the SparkPi too big causes the emption exceptionhuangzhaowei2015-01-161-2/+2
| | | | | | | | | | | | | | | If input of the SparkPi args is larger than the 25000, the integer 'n' inside the code will be overflow, and may be a negative number. And it causes the (0 until n) Seq as an empty seq, then doing the action 'reduce' will throw the UnsupportedOperationException("empty collection"). The max size of the input of sc.parallelize is Int.MaxValue - 1, not the Int.MaxValue. Author: huangzhaowei <carlmartinmax@gmail.com> Closes #2874 from SaintBacchus/SparkPi and squashes the following commits: 62d7cd7 [huangzhaowei] Add a commit to explain the modify 4cdc388 [huangzhaowei] Update SparkPi.scala 9a2fb7b [huangzhaowei] Input of the SparkPi is too big
* [SPARK-5224] [PySpark] improve performance of parallelize list/ndarrayDavies Liu2015-01-152-1/+5
| | | | | | | | | | | | | | | | | | | | | | After the default batchSize changed to 0 (batched based on the size of object), but parallelize() still use BatchedSerializer with batchSize=1, this PR will use batchSize=1024 for parallelize by default. Also, BatchedSerializer did not work well with list and numpy.ndarray, this improve BatchedSerializer by using __len__ and __getslice__. Here is the benchmark for parallelize 1 millions int with list or ndarray: | before | after | improvements ------- | ------------ | ------------- | ------- list | 11.7 s | 0.8 s | 14x numpy.ndarray | 32 s | 0.7 s | 40x Author: Davies Liu <davies@databricks.com> Closes #4024 from davies/opt_numpy and squashes the following commits: 7618c7c [Davies Liu] improve performance of parallelize list/ndarray (cherry picked from commit 3c8650c12ad7a97852e7bd76153210493fd83e92) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* [SPARK-5254][MLLIB] remove developers section from spark.ml guideXiangrui Meng2015-01-141-14/+0
| | | | | | | | | | | | | Forgot to remove this section in #4052. Author: Xiangrui Meng <meng@databricks.com> Closes #4053 from mengxr/SPARK-5254-update and squashes the following commits: f295bde [Xiangrui Meng] remove developers section from spark.ml guide (cherry picked from commit 6abc45e340d3be5f07236adc104db5f8dda0d514) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-5254][MLLIB] Update the user guide to position spark.ml betterXiangrui Meng2015-01-142-14/+21
| | | | | | | | | | | | | | | | The current statement in the user guide may deliver confusing messages to users. spark.ml contains high-level APIs for building ML pipelines. But it doesn't mean that spark.mllib is being deprecated. First of all, the pipeline API is in its alpha stage and we need to see more use cases from the community to stabilizes it, which may take several releases. Secondly, the components in spark.ml are simple wrappers over spark.mllib implementations. Neither the APIs or the implementations from spark.mllib are being deprecated. We expect users use spark.ml pipeline APIs to build their ML pipelines, but we will keep supporting and adding features to spark.mllib. For example, there are many features in review at https://spark-prs.appspot.com/#mllib. So users should be comfortable with using spark.mllib features and expect more coming. The user guide needs to be updated to make the message clear. Author: Xiangrui Meng <meng@databricks.com> Closes #4052 from mengxr/SPARK-5254 and squashes the following commits: 6d5f1d3 [Xiangrui Meng] typo 0cc935b [Xiangrui Meng] update user guide to position spark.ml better (cherry picked from commit 13d2406781714daea2bbf3bfb7fec0dead10760c) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-5234][ml]examples for ml don't have sparkContext.stopYuhao Yang2015-01-143-0/+6
| | | | | | | | | | | | | | | JIRA issue: https://issues.apache.org/jira/browse/SPARK-5234 simply add the call. Author: Yuhao Yang <yuhao@yuhaodevbox.sh.intel.com> Closes #4044 from hhbyyh/addscStop and squashes the following commits: c1f75ac [Yuhao Yang] add SparkContext.stop to 3 ml examples (cherry picked from commit 76389c5b99183e456ff85fd92ea68d95c4c13e82) Signed-off-by: Xiangrui Meng <meng@databricks.com>