aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-2165][YARN]add support for setting maxAppAttempts in the ↵WangTaoTheTonic2015-01-074-3/+19
| | | | | | | | | | | | | | | | | | | | | ApplicationSubmissionContext ...xt https://issues.apache.org/jira/browse/SPARK-2165 I still have 2 questions: * If this config is not set, we should use yarn's corresponding value or a default value(like 2) on spark side? * Is the config name best? Or "spark.yarn.am.maxAttempts"? Author: WangTaoTheTonic <barneystinson@aliyun.com> Closes #3878 from WangTaoTheTonic/SPARK-2165 and squashes the following commits: 1416c83 [WangTaoTheTonic] use the name spark.yarn.maxAppAttempts 202ac85 [WangTaoTheTonic] rephrase some afdfc99 [WangTaoTheTonic] more detailed description 91562c6 [WangTaoTheTonic] add support for setting maxAppAttempts in the ApplicationSubmissionContext
* [YARN][SPARK-4929] Bug fix: fix the yarn-client code to support HAhuangzhaowei2015-01-071-1/+15
| | | | | | | | | | | | | Nowadays, yarn-client will exit directly when the HA change happens no matter how many times the am should retry. The reason may be that the default final status only considerred the sys.exit, and the yarn-client HA cann't benefit from this. So we should distinct the default final status between client and cluster, because the SUCCEEDED status may cause the HA failed in client mode and UNDEFINED may cause the error reporter in cluster when using sys.exit. Author: huangzhaowei <carlmartinmax@gmail.com> Closes #3771 from SaintBacchus/YarnHA and squashes the following commits: c02bfcc [huangzhaowei] Improve the comment of the funciton 'getDefaultFinalStatus' 0e69924 [huangzhaowei] Bug fix: fix the yarn-client code to support HA
* [SPARK-5099][Mllib] Simplify logistic loss functionLiang-Chi Hsieh2015-01-061-3/+9
| | | | | | | | | | | | | | | | This is a minor pr where I think that we can simply take minus of `margin`, instead of subtracting `margin`. Mathematically, they are equal. But the modified equation is the common form of logistic loss function and so more readable. It also computes more accurate value as some quick tests show. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #3899 from viirya/logit_func and squashes the following commits: 91a3860 [Liang-Chi Hsieh] Modified for comment. 0aa51e4 [Liang-Chi Hsieh] Further simplified. 72a295e [Liang-Chi Hsieh] Revert LogLoss back and add more considerations in Logistic Loss. a3f83ca [Liang-Chi Hsieh] Fix a bug. 2bc5712 [Liang-Chi Hsieh] Simplify loss function.
* [SPARK-5050][Mllib] Add unit test for sqdistLiang-Chi Hsieh2015-01-062-3/+33
| | | | | | | | | | | | | | Related to #3643. Follow the previous suggestion to add unit test for `sqdist` in `VectorsSuite`. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #3869 from viirya/sqdist_test and squashes the following commits: fb743da [Liang-Chi Hsieh] Modified for comment and fix bug. 90a08f3 [Liang-Chi Hsieh] Modified for comment. 39a3ca6 [Liang-Chi Hsieh] Take care of special case. b789f42 [Liang-Chi Hsieh] More proper unit test with random sparsity pattern. c36be68 [Liang-Chi Hsieh] Add unit test for sqdist.
* SPARK-5017 [MLlib] - Use SVD to compute determinant and inverse of ↵Travis Galoppo2015-01-062-11/+142
| | | | | | | | | | | | | | | | | | | | covariance matrix MultivariateGaussian was calling both pinv() and det() on the covariance matrix, effectively performing two matrix decompositions. Both values are now computed using the singular value decompositon. Both the pseudo-inverse and the pseudo-determinant are used to guard against singular matrices. Author: Travis Galoppo <tjg2107@columbia.edu> Closes #3871 from tgaloppo/spark-5017 and squashes the following commits: 383b5b3 [Travis Galoppo] MultivariateGaussian - minor optimization in density calculation a5b8bc5 [Travis Galoppo] Added additional points to tests in test suite. Fixed comment in MultivariateGaussian 629d9d0 [Travis Galoppo] Moved some test values from var to val. dc3d0f7 [Travis Galoppo] Catch potential exception calculating pseudo-determinant. Style improvements. d448137 [Travis Galoppo] Added test suite for MultivariateGaussian, including test for degenerate case. 1989be0 [Travis Galoppo] SPARK-5017 - Fixed to use SVD to compute determinant and inverse of covariance matrix. Previous code called both pinv() and det(), effectively performing two matrix decompositions. Additionally, the pinv() implementation in Breeze is known to fail for singular matrices. b4415ea [Travis Galoppo] Merge branch 'spark-5017' of https://github.com/tgaloppo/spark into spark-5017 6f11b6d [Travis Galoppo] SPARK-5017 - Use SVD to compute determinant and inverse of covariance matrix. Code was calling both det() and pinv(), effectively performing two matrix decompositions. Futhermore, Breeze pinv() currently fails for singular matrices. fd9784c [Travis Galoppo] SPARK-5017 - Use SVD to compute determinant and inverse of covariance matrix
* SPARK-4159 [CORE] Maven build doesn't run JUnit test suitesSean Owen2015-01-0639-277/+70
| | | | | | | | | | | | | | | | | | This PR: - Reenables `surefire`, and copies config from `scalatest` (which is itself an old fork of `surefire`, so similar) - Tells `surefire` to test only Java tests - Enables `surefire` and `scalatest` for all children, and in turn eliminates some duplication. For me this causes the Scala and Java tests to be run once each, it seems, as desired. It doesn't affect the SBT build but works for Maven. I still need to verify that all of the Scala tests and Java tests are being run. Author: Sean Owen <sowen@cloudera.com> Closes #3651 from srowen/SPARK-4159 and squashes the following commits: 2e8a0af [Sean Owen] Remove specialized SPARK_HOME setting for REPL, YARN tests as it appears to be obsolete 12e4558 [Sean Owen] Append to unit-test.log instead of overwriting, so that both surefire and scalatest output is preserved. Also standardize/correct comments a bit. e6f8601 [Sean Owen] Reenable Java tests by reenabling surefire with config cloned from scalatest; centralize test config in the parent
* [Minor] Fix comments for GraphX 2D partitioning strategykj-ki2015-01-061-3/+3
| | | | | | | | | | | | The sum of vertices on matrix (v0 to v11) is 12. And, I think one same block overlaps in this strategy. This is minor PR, so I didn't file in JIRA. Author: kj-ki <kikushima.kenji@lab.ntt.co.jp> Closes #3904 from kj-ki/fix-partitionstrategy-comments and squashes the following commits: 79829d9 [kj-ki] Fix comments for 2D partitioning.
* [SPARK-1600] Refactor FileInputStream tests to remove Thread.sleep() calls ↵Josh Rosen2015-01-066-136/+251
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | and SystemClock usage This patch refactors Spark Streaming's FileInputStream tests to remove uses of Thread.sleep() and SystemClock, which should hopefully resolve some longstanding flakiness in these tests (see SPARK-1600). Key changes: - Modify FileInputDStream to use the scheduler's Clock instead of System.currentTimeMillis(); this allows it to be tested using ManualClock. - Fix a synchronization issue in ManualClock's `currentTime` method. - Add a StreamingTestWaiter class which allows callers to block until a certain number of batches have finished. - Change the FileInputStream tests so that files' modification times are manually set based off of ManualClock; this eliminates many Thread.sleep calls. - Update these tests to use the withStreamingContext fixture. Author: Josh Rosen <joshrosen@databricks.com> Closes #3801 from JoshRosen/SPARK-1600 and squashes the following commits: e4494f4 [Josh Rosen] Address a potential race when setting file modification times 8340bd0 [Josh Rosen] Use set comparisons for output. 0b9c252 [Josh Rosen] Fix some ManualClock usage problems. 1cc689f [Josh Rosen] ConcurrentHashMap -> SynchronizedMap db26c3a [Josh Rosen] Use standard timeout in ScalaTest `eventually` blocks. 3939432 [Josh Rosen] Rename StreamingTestWaiter to BatchCounter 0b9c3a1 [Josh Rosen] Wait for checkpoint to complete 863d71a [Josh Rosen] Remove Thread.sleep that was used to make task run slowly b4442c3 [Josh Rosen] batchTimeToSelectedFiles should be thread-safe 15b48ee [Josh Rosen] Replace several TestWaiter methods w/ ScalaTest eventually. fffc51c [Josh Rosen] Revert "Remove last remaining sleep() call" dbb8247 [Josh Rosen] Remove last remaining sleep() call 566a63f [Josh Rosen] Fix log message and comment typos da32f3f [Josh Rosen] Fix log message and comment typos 3689214 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-1600 c8f06b1 [Josh Rosen] Remove Thread.sleep calls in FileInputStream CheckpointSuite test. d4f2d87 [Josh Rosen] Refactor file input stream tests to not rely on SystemClock. dda1403 [Josh Rosen] Add StreamingTestWaiter class. 3c3efc3 [Josh Rosen] Synchronize `currentTime` in ManualClock a95ddc4 [Josh Rosen] Modify FileInputDStream to use Clock class.
* SPARK-4843 [YARN] Squash ExecutorRunnableUtil and ExecutorRunnableKostas Sakellis2015-01-052-213/+172
| | | | | | | | | | | | ExecutorRunnableUtil is a parent of ExecutorRunnable because of the yarn-alpha and yarn-stable split. Now that yarn-alpha is gone, this commit squashes the unnecessary hierarchy. The methods from ExecutorRunnableUtil are added as private. Author: Kostas Sakellis <kostas@cloudera.com> Closes #3696 from ksakellis/kostas-spark-4843 and squashes the following commits: 486716f [Kostas Sakellis] Moved prepareEnvironment call to after yarnConf declaration 470e22e [Kostas Sakellis] Fixed indentation and renamed sparkConf variable 9b1b1c9 [Kostas Sakellis] SPARK-4843 [YARN] Squash ExecutorRunnableUtil and ExecutorRunnable
* [SPARK-5040][SQL] Support expressing unresolved attributes using $"attribute ↵Reynold Xin2015-01-052-0/+21
| | | | | | | | | | | name" notation in SQL DSL. Author: Reynold Xin <rxin@databricks.com> Closes #3862 from rxin/stringcontext-attr and squashes the following commits: 9b10f57 [Reynold Xin] Rename StrongToAttributeConversionHelper 72121af [Reynold Xin] [SPARK-5040][SQL] Support expressing unresolved attributes using $"attribute name" notation in SQL DSL.
* [SPARK-5093] Set spark.network.timeout to 120s consistently.Reynold Xin2015-01-055-13/+8
| | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #3903 from rxin/timeout-120 and squashes the following commits: 7c2138e [Reynold Xin] [SPARK-5093] Set spark.network.timeout to 120s consistently.
* [SPARK-5089][PYSPARK][MLLIB] Fix vector convertfreeman2015-01-052-1/+11
| | | | | | | | | | | | | | | This is a small change addressing a potentially significant bug in how PySpark + MLlib handles non-float64 numpy arrays. The automatic conversion to `DenseVector` that occurs when passing RDDs to MLlib algorithms in PySpark should automatically upcast to float64s, but currently this wasn't actually happening. As a result, non-float64 would be silently parsed inappropriately during SerDe, yielding erroneous results when running, for example, KMeans. The PR includes the fix, as well as a new test for the correct conversion behavior. davies Author: freeman <the.freeman.lab@gmail.com> Closes #3902 from freeman-lab/fix-vector-convert and squashes the following commits: 764db47 [freeman] Add a test for proper conversion behavior 704f97e [freeman] Return array after changing type
* [SPARK-4465] runAsSparkUser doesn't affect TaskRunner in Mesos environme...Jongyoul Lee2015-01-051-8/+7
| | | | | | | | | | | | | | | ...nt at all. - fixed a scope of runAsSparkUser from MesosExecutorDriver.run to MesosExecutorBackend.launchTask - See the Jira Issue for more details. Author: Jongyoul Lee <jongyoul@gmail.com> Closes #3741 from jongyoul/SPARK-4465 and squashes the following commits: 46ad71e [Jongyoul Lee] [SPARK-4465] runAsSparkUser doesn't affect TaskRunner in Mesos environment at all. - Removed unused import 3d6631f [Jongyoul Lee] [SPARK-4465] runAsSparkUser doesn't affect TaskRunner in Mesos environment at all. - Removed comments and adjusted indentations 2343f13 [Jongyoul Lee] [SPARK-4465] runAsSparkUser doesn't affect TaskRunner in Mesos environment at all. - fixed a scope of runAsSparkUser from MesosExecutorDriver.run to MesosExecutorBackend.launchTask
* [SPARK-5057] Log message in failed askWithReply attemptsWangTao2015-01-051-7/+7
| | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-5057 Author: WangTao <barneystinson@aliyun.com> Author: WangTaoTheTonic <barneystinson@aliyun.com> Closes #3875 from WangTaoTheTonic/SPARK-5057 and squashes the following commits: 1503487 [WangTao] use string interpolation 706c8a7 [WangTaoTheTonic] log more messages
* [SPARK-4688] Have a single shared network timeout in SparkVarun Saxena2015-01-055-5/+21
| | | | | | | | | | | | | | | | [SPARK-4688] Have a single shared network timeout in Spark Author: Varun Saxena <vsaxena.varun@gmail.com> Author: varunsaxena <vsaxena.varun@gmail.com> Closes #3562 from varunsaxena/SPARK-4688 and squashes the following commits: 6e97f72 [Varun Saxena] [SPARK-4688] Single shared network timeout cd783a2 [Varun Saxena] SPARK-4688 d6f8c29 [Varun Saxena] SCALA-4688 9562b15 [Varun Saxena] SPARK-4688 a75f014 [varunsaxena] SPARK-4688 594226c [varunsaxena] SPARK-4688
* [SPARK-5074][Core] Fix a non-deterministic test failurezsxwing2015-01-041-0/+1
| | | | | | | | | | Add `assert(sc.listenerBus.waitUntilEmpty(WAIT_TIMEOUT_MILLIS))` to make sure `sparkListener` receive the message. Author: zsxwing <zsxwing@gmail.com> Closes #3889 from zsxwing/SPARK-5074 and squashes the following commits: e61c198 [zsxwing] Fix a non-deterministic test failure
* [SPARK-5083][Core] Fix a flaky test in TaskResultGetterSuitezsxwing2015-01-041-2/+20
| | | | | | | | | | Because `sparkEnv.blockManager.master.removeBlock` is asynchronous, we need to make sure the block has already been removed before calling `super.enqueueSuccessfulTask`. Author: zsxwing <zsxwing@gmail.com> Closes #3894 from zsxwing/SPARK-5083 and squashes the following commits: d97c03d [zsxwing] Fix a flaky test in TaskResultGetterSuite
* [SPARK-5069][Core] Fix the race condition of TaskSchedulerImpl.dagSchedulerzsxwing2015-01-042-7/+1
| | | | | | | | | | It's not necessary to set `TaskSchedulerImpl.dagScheduler` in preStart. It's safe to set it after `initializeEventProcessActor()`. Author: zsxwing <zsxwing@gmail.com> Closes #3887 from zsxwing/SPARK-5069 and squashes the following commits: d95894f [zsxwing] Fix the race condition of TaskSchedulerImpl.dagScheduler
* [SPARK-5067][Core] Use '===' to compare well-defined case classzsxwing2015-01-041-28/+4
| | | | | | | | | | A simple fix would be adding `assert(e1.appId == e2.appId)` for `SparkListenerApplicationStart`. But actually we can use `===` for well-defined case class directly. Therefore, instead of fixing this issue, I use `===` to compare those well-defined case classes (all fields have implemented a correct `equals` method, such as primitive types) Author: zsxwing <zsxwing@gmail.com> Closes #3886 from zsxwing/SPARK-5067 and squashes the following commits: 0a51711 [zsxwing] Use '===' to compare well-defined case class
* [SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobsJosh Rosen2015-01-046-7/+75
| | | | | | | | | | | | | | | | | | | | | This patch disables output spec. validation for jobs launched through Spark Streaming, since this interferes with checkpoint recovery. Hadoop OutputFormats have a `checkOutputSpecs` method which performs certain checks prior to writing output, such as checking whether the output directory already exists. SPARK-1100 added checks for FileOutputFormat, SPARK-1677 (#947) added a SparkConf configuration to disable these checks, and SPARK-2309 (#1088) extended these checks to run for all OutputFormats, not just FileOutputFormat. In Spark Streaming, we might have to re-process a batch during checkpoint recovery, so `save` actions may be called multiple times. In addition to `DStream`'s own save actions, users might use `transform` or `foreachRDD` and call the `RDD` and `PairRDD` save actions. When output spec. validation is enabled, the second calls to these actions will fail due to existing output. This patch automatically disables output spec. validation for jobs submitted by the Spark Streaming scheduler. This is done by using Scala's `DynamicVariable` to propagate the bypass setting without having to mutate SparkConf or introduce a global variable. Author: Josh Rosen <joshrosen@databricks.com> Closes #3832 from JoshRosen/SPARK-4835 and squashes the following commits: 36eaf35 [Josh Rosen] Add comment explaining use of transform() in test. 6485cf8 [Josh Rosen] Add test case in Streaming; fix bug for transform() 7b3e06a [Josh Rosen] Remove Streaming-specific setting to undo this change; update conf. guide bf9094d [Josh Rosen] Revise disableOutputSpecValidation() comment to not refer to Spark Streaming. e581d17 [Josh Rosen] Deduplicate isOutputSpecValidationEnabled logic. 762e473 [Josh Rosen] [SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs.
* [SPARK-4631] unit test for MQTTbilna2015-01-042-15/+101
| | | | | | | | | | | | | | | | | | | | | | | | Please review the unit test for MQTT Author: bilna <bilnap@am.amrita.edu> Author: Bilna P <bilna.p@gmail.com> Closes #3844 from Bilna/master and squashes the following commits: acea3a3 [bilna] Adding dependency with scope test 28681fa [bilna] Merge remote-tracking branch 'upstream/master' fac3904 [bilna] Correction in Indentation and coding style ed9db4c [bilna] Merge remote-tracking branch 'upstream/master' 4b34ee7 [Bilna P] Update MQTTStreamSuite.scala 04503cf [bilna] Added embedded broker service for mqtt test 89d804e [bilna] Merge remote-tracking branch 'upstream/master' fc8eb28 [bilna] Merge remote-tracking branch 'upstream/master' 4b58094 [Bilna P] Update MQTTStreamSuite.scala b1ac4ad [bilna] Added BeforeAndAfter 5f6bfd2 [bilna] Added BeforeAndAfter e8b6623 [Bilna P] Update MQTTStreamSuite.scala 5ca6691 [Bilna P] Update MQTTStreamSuite.scala 8616495 [bilna] [SPARK-4631] unit test for MQTT
* [SPARK-4787] Stop SparkContext if a DAGScheduler init error occursDale2015-01-041-2/+7
| | | | | | | | | Author: Dale <tigerquoll@outlook.com> Closes #3809 from tigerquoll/SPARK-4787 and squashes the following commits: 5661e01 [Dale] [SPARK-4787] Ensure that call to stop() doesn't lose the exception by using a finally block. 2172578 [Dale] [SPARK-4787] Stop context properly if an exception occurs during DAGScheduler initialization.
* [SPARK-794][Core] Remove sleep() in ClusterScheduler.stopBrennon York2015-01-041-3/+0
| | | | | | | | | | Removed `sleep()` from the `stop()` method of the `TaskSchedulerImpl` class which, from the JIRA ticket, is believed to be a legacy artifact slowing down testing originally introduced in the `ClusterScheduler` class. Author: Brennon York <brennon.york@capitalone.com> Closes #3851 from brennonyork/SPARK-794 and squashes the following commits: 04c3e64 [Brennon York] Removed sleep() from the stop() method
* [SPARK-5058] Updated broken linkssigmoidanalytics2015-01-031-1/+1
| | | | | | | | | | Updated the broken link pointing to the KafkaWordCount example to the correct one. Author: sigmoidanalytics <mayur@sigmoidanalytics.com> Closes #3877 from sigmoidanalytics/patch-1 and squashes the following commits: 3e19b31 [sigmoidanalytics] Updated broken links
* Fixed typos in streaming-kafka-integration.mdAkhil Das2015-01-021-1/+1
| | | | | | | | | | Changed projrect to project :) Author: Akhil Das <akhld@darktech.ca> Closes #3876 from akhld/patch-1 and squashes the following commits: e0cf9ef [Akhil Das] Fixed typos in streaming-kafka-integration.md
* [SPARK-3325][Streaming] Add a parameter to the method print in class DStreamYadong Qi2015-01-024-9/+30
| | | | | | | | | | | | | | | | | | | | | | | This PR is a fixed version of the original PR #3237 by watermen and scwf. This adds the ability to specify how many elements to print in `DStream.print`. Author: Yadong Qi <qiyadong2010@gmail.com> Author: q00251598 <qiyadong@huawei.com> Author: Tathagata Das <tathagata.das1565@gmail.com> Author: wangfei <wangfei1@huawei.com> Closes #3865 from tdas/print-num and squashes the following commits: cd34e9e [Tathagata Das] Fix bug 7c09f16 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into HEAD bb35d1a [Yadong Qi] Update MimaExcludes.scala f8098ca [Yadong Qi] Update MimaExcludes.scala f6ac3cb [Yadong Qi] Update MimaExcludes.scala e4ed897 [Yadong Qi] Update MimaExcludes.scala 3b9d5cf [wangfei] fix conflicts ec8a3af [q00251598] move to Spark 1.3 26a70c0 [q00251598] extend the Python DStream's print b589a4b [q00251598] add another print function
* [HOTFIX] Bind web UI to ephemeral port in DriverSuiteJosh Rosen2015-01-011-1/+4
| | | | | | | | | | | | The job launched by DriverSuite should bind the web UI to an ephemeral port, since it looks like port contention in this test has caused a large number of Jenkins failures when many builds are started simultaneously. Our tests already disable the web UI, but this doesn't affect subprocesses launched by our tests. In this case, I've opted to bind to an ephemeral port instead of disabling the UI because disabling features in this test may mask its ability to catch certain bugs. See also: e24d3a9 Author: Josh Rosen <joshrosen@databricks.com> Closes #3873 from JoshRosen/driversuite-webui-port and squashes the following commits: 48cd05c [Josh Rosen] [HOTFIX] Bind web UI to ephemeral port in DriverSuite.
* [SPARK-5038] Add explicit return type for implicit functions.Reynold Xin2014-12-316-63/+64
| | | | | | | | | | | | As we learned in #3580, not explicitly typing implicit functions can lead to compiler bugs and potentially unexpected runtime behavior. This is a follow up PR for rest of Spark (outside Spark SQL). The original PR for Spark SQL can be found at https://github.com/apache/spark/pull/3859 Author: Reynold Xin <rxin@databricks.com> Closes #3860 from rxin/implicit and squashes the following commits: 73702f9 [Reynold Xin] [SPARK-5038] Add explicit return type for implicit functions.
* SPARK-2757 [BUILD] [STREAMING] Add Mima test for Spark Sink after 1.10 is ↵Sean Owen2014-12-312-1/+6
| | | | | | | | | | | | | released Re-enable MiMa for Streaming Flume Sink module, now that 1.1.0 is released, per the JIRA TO-DO. That's pretty much all there is to this. Author: Sean Owen <sowen@cloudera.com> Closes #3842 from srowen/SPARK-2757 and squashes the following commits: 50ff80e [Sean Owen] Exclude apparent false positive turned up by re-enabling MiMa checks for Streaming Flume Sink 0e5ba5c [Sean Owen] Re-enable MiMa for Streaming Flume Sink module
* [SPARK-5035] [Streaming] ReceiverMessage trait should extend SerializableJosh Rosen2014-12-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Spark Streaming's ReceiverMessage trait should extend Serializable in order to fix a subtle bug that only occurs when running on a real cluster: If you attempt to send a fire-and-forget message to a remote Akka actor and that message cannot be serialized, then this seems to lead to more-or-less silent failures. As an optimization, Akka skips message serialization for messages sent within the same JVM. As a result, Spark's unit tests will never fail due to non-serializable Akka messages, but these will cause mostly-silent failures when running on a real cluster. Before this patch, here was the code for ReceiverMessage: ``` /** Messages sent to the NetworkReceiver. */ private[streaming] sealed trait ReceiverMessage private[streaming] object StopReceiver extends ReceiverMessage ``` Since ReceiverMessage does not extend Serializable and StopReceiver is a regular `object`, not a `case object`, StopReceiver will throw serialization errors. As a result, graceful receiver shutdown is broken on real clusters (and local-cluster mode) but works in local modes. If you want to reproduce this, try running the word count example from the Streaming Programming Guide in the Spark shell: ``` import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ val ssc = new StreamingContext(sc, Seconds(10)) // Create a DStream that will connect to hostname:port, like localhost:9999 val lines = ssc.socketTextStream("localhost", 9999) // Split each line into words val words = lines.flatMap(_.split(" ")) import org.apache.spark.streaming.StreamingContext._ // Count each word in each batch val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) // Print the first ten elements of each RDD generated in this DStream to the console wordCounts.print() ssc.start() Thread.sleep(10000) ssc.stop(true, true) ``` Prior to this patch, this would work correctly in local mode but fail when running against a real cluster (it would report that some receivers were not shut down). Author: Josh Rosen <joshrosen@databricks.com> Closes #3857 from JoshRosen/SPARK-5035 and squashes the following commits: 71d0eae [Josh Rosen] [SPARK-5035] ReceiverMessage trait should extend Serializable.
* SPARK-5020 [MLlib] GaussianMixtureModel.predictMembership() should take an ↵Travis Galoppo2014-12-311-7/+2
| | | | | | | | | | | | | | | RDD only Removed unnecessary parameters to predictMembership() CC: jkbradley Author: Travis Galoppo <tjg2107@columbia.edu> Closes #3854 from tgaloppo/spark-5020 and squashes the following commits: 1bf4669 [Travis Galoppo] renamed predictMembership() to predictSoft() 0f1d96e [Travis Galoppo] SPARK-5020 - Removed superfluous parameters from predictMembership()
* [SPARK-5028][Streaming]Add total received and processed records metrics to ↵jerryshao2014-12-311-0/+6
| | | | | | | | | | | | | | Streaming UI This is a follow-up work of [SPARK-4537](https://issues.apache.org/jira/browse/SPARK-4537). Adding total received records and processed records metrics back to UI. ![screenshot](https://dl.dropboxusercontent.com/u/19230832/screenshot.png) Author: jerryshao <saisai.shao@intel.com> Closes #3852 from jerryshao/SPARK-5028 and squashes the following commits: c8c4877 [jerryshao] Add total received and processed metrics to Streaming UI
* [SPARK-4790][STREAMING] Fix ReceivedBlockTrackerSuite waits for old file...Hari Shreedharan2014-12-317-16/+42
| | | | | | | | | | | | | | | | | | | | ...s to get deleted before continuing. Since the deletes are happening asynchronously, the getFileStatus call might throw an exception in older HDFS versions, if the delete happens between the time listFiles is called on the directory and getFileStatus is called on the file in the getFileStatus method. This PR addresses this by adding an option to delete the files synchronously and then waiting for the deletion to complete before proceeding. Author: Hari Shreedharan <hshreedharan@apache.org> Closes #3726 from harishreedharan/spark-4790 and squashes the following commits: bbbacd1 [Hari Shreedharan] Call cleanUpOldLogs only once in the tests. 3255f17 [Hari Shreedharan] Add test for async deletion. Remove method from ReceiverTracker that does not take waitForCompletion. e4c83ec [Hari Shreedharan] Making waitForCompletion a mandatory param. Remove eventually from WALSuite since the cleanup method returns only after all files are deleted. af00fd1 [Hari Shreedharan] [SPARK-4790][STREAMING] Fix ReceivedBlockTrackerSuite waits for old files to get deleted before continuing.
* [SPARK-5038][SQL] Add explicit return type for implicit functions in Spark SQLReynold Xin2014-12-312-41/+41
| | | | | | | | | | As we learned in https://github.com/apache/spark/pull/3580, not explicitly typing implicit functions can lead to compiler bugs and potentially unexpected runtime behavior. Author: Reynold Xin <rxin@databricks.com> Closes #3859 from rxin/sql-implicits and squashes the following commits: 30c2c24 [Reynold Xin] [SPARK-5038] Add explicit return type for implicit functions in Spark SQL.
* [HOTFIX] Disable Spark UI in SparkSubmitSuite testsJosh Rosen2014-12-311-0/+2
| | | | This should fix a major cause of build breaks when running many parallel tests.
* SPARK-4547 [MLLIB] OOM when making bins in BinaryClassificationMetricsSean Owen2014-12-312-3/+92
| | | | | | | | | | | | | | | | Now that I've implemented the basics here, I'm less convinced there is a need for this change, somehow. Callers can downsample before or after. Really the OOM is not in the ROC curve code, but in code that might `collect()` it for local analysis. Still, might be useful to down-sample since the ROC curve probably never needs millions of points. This is a first pass. Since the `(score,label)` are already grouped and sorted, I think it's sufficient to just take every Nth such pair, in order to downsample by a factor of N? this is just like retaining every Nth point on the curve, which I think is the goal. All of the data is still used to build the curve of course. What do you think about the API, and usefulness? Author: Sean Owen <sowen@cloudera.com> Closes #3702 from srowen/SPARK-4547 and squashes the following commits: 1d34d05 [Sean Owen] Indent and reorganize numBins scaladoc 692d825 [Sean Owen] Change handling of large numBins, make 2nd consturctor instead of optional param, style change a03610e [Sean Owen] Add downsamplingFactor to BinaryClassificationMetrics
* [SPARK-4298][Core] - The spark-submit cannot read Main-Class from Manifest.Brennon York2014-12-311-8/+18
| | | | | | | | | | | | | | | Resolves a bug where the `Main-Class` from a .jar file wasn't being read in properly. This was caused by the fact that the `primaryResource` object was a URI and needed to be normalized through a call to `.getPath` before it could be passed into the `JarFile` object. Author: Brennon York <brennon.york@capitalone.com> Closes #3561 from brennonyork/SPARK-4298 and squashes the following commits: 5e0fce1 [Brennon York] Use string interpolation for error messages, moved comment line from original code to above its necessary code segment 14daa20 [Brennon York] pushed mainClass assignment into match statement, removed spurious spaces, removed { } from case statements, removed return values c6dad68 [Brennon York] Set case statement to support multiple jar URI's and enabled the 'file' URI to load the main-class 8d20936 [Brennon York] updated to reset the error message back to the default a043039 [Brennon York] updated to split the uri and jar vals 8da7cbf [Brennon York] fixes SPARK-4298
* [SPARK-4797] Replace breezeSquaredDistanceLiang-Chi Hsieh2014-12-313-8/+100
| | | | | | | | | | | | | | | | | | | | This PR replaces slow breezeSquaredDistance. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #3643 from viirya/faster_squareddistance and squashes the following commits: f28b275 [Liang-Chi Hsieh] Move the implementation to linalg.Vectors and rename as sqdist. 0bc48ee [Liang-Chi Hsieh] Merge branch 'master' into faster_squareddistance ba34422 [Liang-Chi Hsieh] Fix bug. 91849d0 [Liang-Chi Hsieh] Modified for comment. 44a65ad [Liang-Chi Hsieh] Modified for comments. 35db395 [Liang-Chi Hsieh] Fix bug and some modifications for comments. f4f5ebb [Liang-Chi Hsieh] Follow BLAS.dot pattern to replace intersect, diff with while-loop. a36e09f [Liang-Chi Hsieh] Use while-loop to replace foreach for better performance. d3e0628 [Liang-Chi Hsieh] Make the methods private. dd415bc [Liang-Chi Hsieh] Consider different cases of SparseVector and DenseVector. 13669db [Liang-Chi Hsieh] Replace breezeSquaredDistance.
* [SPARK-1010] Clean up uses of System.setProperty in unit testsJosh Rosen2014-12-3023-232/+216
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Several of our tests call System.setProperty (or test code which implicitly sets system properties) and don't always reset/clear the modified properties, which can create ordering dependencies between tests and cause hard-to-diagnose failures. This patch removes most uses of System.setProperty from our tests, since in most cases we can use SparkConf to set these configurations (there are a few exceptions, including the tests of SparkConf itself). For the cases where we continue to use System.setProperty, this patch introduces a `ResetSystemProperties` ScalaTest mixin class which snapshots the system properties before individual tests and to automatically restores them on test completion / failure. See the block comment at the top of the ResetSystemProperties class for more details. Author: Josh Rosen <joshrosen@databricks.com> Closes #3739 from JoshRosen/cleanup-system-properties-in-tests and squashes the following commits: 0236d66 [Josh Rosen] Replace setProperty uses in two example programs / tools 3888fe3 [Josh Rosen] Remove setProperty use in LocalJavaStreamingContext 4f4031d [Josh Rosen] Add note on why SparkSubmitSuite needs ResetSystemProperties 4742a5b [Josh Rosen] Clarify ResetSystemProperties trait inheritance ordering. 0eaf0b6 [Josh Rosen] Remove setProperty call in TaskResultGetterSuite. 7a3d224 [Josh Rosen] Fix trait ordering 3fdb554 [Josh Rosen] Remove setProperty call in TaskSchedulerImplSuite bee20df [Josh Rosen] Remove setProperty calls in SparkContextSchedulerCreationSuite 655587c [Josh Rosen] Remove setProperty calls in JobCancellationSuite 3f2f955 [Josh Rosen] Remove System.setProperty calls in DistributedSuite cfe9cce [Josh Rosen] Remove use of system properties in SparkContextSuite 8783ab0 [Josh Rosen] Remove TestUtils.setSystemProperty, since it is subsumed by the ResetSystemProperties trait. 633a84a [Josh Rosen] Remove use of system properties in FileServerSuite 25bfce2 [Josh Rosen] Use ResetSystemProperties in UtilsSuite 1d1aa5a [Josh Rosen] Use ResetSystemProperties in SizeEstimatorSuite dd9492b [Josh Rosen] Use ResetSystemProperties in AkkaUtilsSuite b0daff2 [Josh Rosen] Use ResetSystemProperties in BlockManagerSuite e9ded62 [Josh Rosen] Use ResetSystemProperties in TaskSchedulerImplSuite 5b3cb54 [Josh Rosen] Use ResetSystemProperties in SparkListenerSuite 0995c4b [Josh Rosen] Use ResetSystemProperties in SparkContextSchedulerCreationSuite c83ded8 [Josh Rosen] Use ResetSystemProperties in SparkConfSuite 51aa870 [Josh Rosen] Use withSystemProperty in ShuffleSuite 60a63a1 [Josh Rosen] Use ResetSystemProperties in JobCancellationSuite 14a92e4 [Josh Rosen] Use withSystemProperty in FileServerSuite 628f46c [Josh Rosen] Use ResetSystemProperties in DistributedSuite 9e3e0dd [Josh Rosen] Add ResetSystemProperties test fixture mixin; use it in SparkSubmitSuite. 4dcea38 [Josh Rosen] Move withSystemProperty to TestUtils class.
* [SPARK-4998][MLlib]delete the "train" functionLiu Jiongzhou2014-12-301-7/+0
| | | | | | | | | | | | | To make the functions with the same in "object" effective, specially when using java reflection. As the "train" function defined in "class DecisionTree" will hide the functions with the same name in "object DecisionTree". JIRA[SPARK-4998] Author: Liu Jiongzhou <ljzzju@163.com> Closes #3836 from ljzzju/master and squashes the following commits: 4e13133 [Liu Jiongzhou] [MLlib]delete the "train" function
* [SPARK-4813][Streaming] Fix the issue that ContextWaiter didn't handle ↵zsxwing2014-12-301-15/+48
| | | | | | | | | | | | | | 'spurious wakeup' Used `Condition` to rewrite `ContextWaiter` because it provides a convenient API `awaitNanos` for timeout. Author: zsxwing <zsxwing@gmail.com> Closes #3661 from zsxwing/SPARK-4813 and squashes the following commits: 52247f5 [zsxwing] Add explicit unit type be42bcf [zsxwing] Update as per review suggestion e06bd4f [zsxwing] Fix the issue that ContextWaiter didn't handle 'spurious wakeup'
* [Spark-4995] Replace Vector.toBreeze.activeIterator with foreachActiveJakub Dubovsky2014-12-303-6/+8
| | | | | | | | | | | | | | | New foreachActive method of vector was introduced by SPARK-4431 as more efficient alternative to vector.toBreeze.activeIterator. There are some parts of codebase where it was not yet replaced. dbtsai Author: Jakub Dubovsky <dubovsky@avast.com> Closes #3846 from james64/SPARK-4995-foreachActive and squashes the following commits: 3eb7e37 [Jakub Dubovsky] Scalastyle fix 32fe6c6 [Jakub Dubovsky] activeIterator removed - IndexedRowMatrix.toBreeze 47a4777 [Jakub Dubovsky] activeIterator removed in RowMatrix.toBreeze 90a7d98 [Jakub Dubovsky] activeIterator removed in MLUtils.saveAsLibSVMFile
* SPARK-3955 part 2 [CORE] [HOTFIX] Different versions between ↵Sean Owen2014-12-301-1/+1
| | | | | | | | | | | | jackson-mapper-asl and jackson-core-asl pwendell https://github.com/apache/spark/commit/2483c1efb6429a7d8a20c96d18ce2fec93a1aff9 didn't actually add a reference to `jackson-core-asl` as intended, but a second redundant reference to `jackson-mapper-asl`, as markhamstra picked up on (https://github.com/apache/spark/pull/3716#issuecomment-68180192) This just rectifies the typo. I missed it as well; the original PR https://github.com/apache/spark/pull/2818 had it correct and I also didn't see the problem. Author: Sean Owen <sowen@cloudera.com> Closes #3829 from srowen/SPARK-3955 and squashes the following commits: 6cfdc4e [Sean Owen] Actually refer to jackson-core-asl
* [SPARK-4570][SQL]add BroadcastLeftSemiJoinHashwangxiaojing2014-12-304-1/+160
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | JIRA issue: [SPARK-4570](https://issues.apache.org/jira/browse/SPARK-4570) We are planning to create a `BroadcastLeftSemiJoinHash` to implement the broadcast join for `left semijoin` In left semijoin : If the size of data from right side is smaller than the user-settable threshold `AUTO_BROADCASTJOIN_THRESHOLD`, the planner would mark it as the `broadcast` relation and mark the other relation as the stream side. The broadcast table will be broadcasted to all of the executors involved in the join, as a `org.apache.spark.broadcast.Broadcast` object. It will use `joins.BroadcastLeftSemiJoinHash`.,else it will use `joins.LeftSemiJoinHash`. The benchmark suggests these made the optimized version 4x faster when `left semijoin` <pre><code> Original: left semi join : 9288 ms Optimized: left semi join : 1963 ms </code></pre> The micro benchmark load `data1/kv3.txt` into a normal Hive table. Benchmark code: <pre><code> def benchmark(f: => Unit) = { val begin = System.currentTimeMillis() f val end = System.currentTimeMillis() end - begin } val sc = new SparkContext( new SparkConf() .setMaster("local") .setAppName(getClass.getSimpleName.stripSuffix("$"))) val hiveContext = new HiveContext(sc) import hiveContext._ sql("drop table if exists left_table") sql("drop table if exists right_table") sql( """create table left_table (key int, value string) """.stripMargin) sql( s"""load data local inpath "/data1/kv3.txt" into table left_table""") sql( """create table right_table (key int, value string) """.stripMargin) sql( """ |from left_table |insert overwrite table right_table |select left_table.key, left_table.value """.stripMargin) val leftSimeJoin = sql( """select a.key from left_table a |left semi join right_table b on a.key = b.key""".stripMargin) val leftSemiJoinDuration = benchmark(leftSimeJoin.count()) println(s"left semi join : $leftSemiJoinDuration ms ") </code></pre> Author: wangxiaojing <u9jing@gmail.com> Closes #3442 from wangxiaojing/SPARK-4570 and squashes the following commits: a4a43c9 [wangxiaojing] rebase f103983 [wangxiaojing] change style fbe4887 [wangxiaojing] change style ff2e618 [wangxiaojing] add testsuite 1a8da2a [wangxiaojing] add BroadcastLeftSemiJoinHash
* [SPARK-4935][SQL] When hive.cli.print.header configured, spark-sql aborted ↵wangfei2014-12-301-1/+1
| | | | | | | | | | | | | | | if passed in a invalid sql If we passed in a wrong sql like ```abdcdfsfs```, the spark-sql script aborted. Author: wangfei <wangfei1@huawei.com> Author: Fei Wang <wangfei1@huawei.com> Closes #3761 from scwf/patch-10 and squashes the following commits: 46dc344 [Fei Wang] revert console.printError(rc.getErrorMessage()) 0330e07 [wangfei] avoid to print error message repeatedly 1614a11 [wangfei] spark-sql abort when passed in a wrong sql
* [SPARK-4386] Improve performance when writing Parquet filesMichael Davies2014-12-301-2/+2
| | | | | | | | | | | | | | Convert type of RowWriteSupport.attributes to Array. Analysis of performance for writing very wide tables shows that time is spent predominantly in apply method on attributes var. Type of attributes previously was LinearSeqOptimized and apply is O(N) which made write O(N squared). Measurements on 575 column table showed this change made a 6x improvement in write times. Author: Michael Davies <Michael.BellDavies@gmail.com> Closes #3843 from MickDavies/SPARK-4386 and squashes the following commits: 892519d [Michael Davies] [SPARK-4386] Improve performance when writing Parquet files
* [SPARK-4937][SQL] Normalizes conjunctions and disjunctions to eliminate ↵Cheng Lian2014-12-304-8/+110
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | common predicates This PR is a simplified version of several filter optimization rules introduced in #3778 authored by scwf. Newly introduced optimizations include: 1. `a && a` => `a` 2. `a || a` => `a` 3. `(a || b || c || ...) && (a || b || d || ...)` => `a && b && (c || d || ...)` The 3rd rule is particularly useful for optimizing the following query, which is planned into a cartesian product ```sql SELECT * FROM t1, t2 WHERE (t1.key = t2.key AND t1.value > 10) OR (t1.key = t2.key AND t2.value < 20) ``` to the following one, which is planned into an equi-join: ```sql SELECT * FROM t1, t2 WHERE t1.key = t2.key AND (t1.value > 10 OR t2.value < 20) ``` The example above is quite artificial, but common predicates are likely to appear in real life complex queries (like the one mentioned in #3778). A difference between this PR and #3778 is that these optimizations are not limited to `Filter`, but are generalized to all logical plan nodes. Thanks to scwf for bringing up these optimizations, and chenghao-intel for the generalization suggestion. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3784) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3784 from liancheng/normalize-filters and squashes the following commits: caca560 [Cheng Lian] Moves filter normalization into BooleanSimplification rule 4ab3a58 [Cheng Lian] Fixes test failure, adds more tests 5d54349 [Cheng Lian] Fixes typo in comment 2abbf8e [Cheng Lian] Forgot our sacred Apache licence header... cf95639 [Cheng Lian] Adds an optimization rule for filter normalization
* [SPARK-4928][SQL] Fix: Operator '>,<,>=,<=' with decimal between different ↵guowei22014-12-302-0/+33
| | | | | | | | | | | | | | | precision report error case operator with decimal between different precision, we need change them to unlimited Author: guowei2 <guowei2@asiainfo.com> Closes #3767 from guowei2/SPARK-4928 and squashes the following commits: c6a6e3e [guowei2] fix code style 3214e0a [guowei2] add test case b4985a2 [guowei2] fix code style 27adf42 [guowei2] Fix: Operation '>,<,>=,<=' with Decimal report error
* [SPARK-4930][SQL][DOCS]Update SQL programming guide, CACHE TABLE is eagerluogankun2014-12-301-5/+4
| | | | | | | | | | | `CACHE TABLE tbl` is now __eager__ by default not __lazy__ Author: luogankun <luogankun@gmail.com> Closes #3773 from luogankun/SPARK-4930 and squashes the following commits: cc17b7d [luogankun] [SPARK-4930][SQL][DOCS]Update SQL programming guide, add CACHE [LAZY] TABLE [AS SELECT] ... bffe0e8 [luogankun] [SPARK-4930][SQL][DOCS]Update SQL programming guide, CACHE TABLE tbl is eager
* [SPARK-4916][SQL][DOCS]Update SQL programming guide about cache sectionluogankun2014-12-301-4/+1
| | | | | | | | | | | | | `SchemeRDD.cache()` now uses in-memory columnar storage. Author: luogankun <luogankun@gmail.com> Closes #3759 from luogankun/SPARK-4916 and squashes the following commits: 7b39864 [luogankun] [SPARK-4916]Update SQL programming guide 6018122 [luogankun] Merge branch 'master' of https://github.com/apache/spark into SPARK-4916 0b93785 [luogankun] [SPARK-4916]Update SQL programming guide 99b2336 [luogankun] [SPARK-4916]Update SQL programming guide