spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-10812] [YARN] Spark hadoop util support switching to yarn	Holden Karau	2015-09-28	2	-15/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	While this is likely not a huge issue for real production systems, for test systems which may setup a Spark Context and tear it down and stand up a Spark Context with a different master (e.g. some local mode & some yarn mode) tests this cane be an issue. Discovered during work on spark-testing-base on Spark 1.4.1, but seems like the logic that triggers it is present in master (see SparkHadoopUtil object). A valid work around for users encountering this issue is to fork a different JVM, however this can be heavy weight. ``` [info] SampleMiniClusterTest: [info] Exception encountered when attempting to run a suite with class name: com.holdenkarau.spark.testing.SampleMiniClusterTest * ABORTED * [info] java.lang.ClassCastException: org.apache.spark.deploy.SparkHadoopUtil cannot be cast to org.apache.spark.deploy.yarn.YarnSparkHadoopUtil [info] at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.get(YarnSparkHadoopUtil.scala:163) [info] at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:257) [info] at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:561) [info] at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115) [info] at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57) [info] at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141) [info] at org.apache.spark.SparkContext.<init>(SparkContext.scala:497) [info] at com.holdenkarau.spark.testing.SharedMiniCluster$class.setup(SharedMiniCluster.scala:186) [info] at com.holdenkarau.spark.testing.SampleMiniClusterTest.setup(SampleMiniClusterTest.scala:26) [info] at com.holdenkarau.spark.testing.SharedMiniCluster$class.beforeAll(SharedMiniCluster.scala:103) ``` Author: Holden Karau <holden@pigscanfly.ca> Closes #8911 from holdenk/SPARK-10812-spark-hadoop-util-support-switching-to-yarn.
*	[SPARK-9852] Let reduce tasks fetch multiple map output partitions	Matei Zaharia	2015-09-24	8	-124/+300
\| \| \| \| \| \| \| \| \| \| \| \| \|	This makes two changes: - Allow reduce tasks to fetch multiple map output partitions -- this is a pretty small change to HashShuffleFetcher - Move shuffle locality computation out of DAGScheduler and into ShuffledRDD / MapOutputTracker; this was needed because the code in DAGScheduler wouldn't work for RDDs that fetch multiple map output partitions from each reduce task I also added an AdaptiveSchedulingSuite that creates RDDs depending on multiple map output partitions. Author: Matei Zaharia <matei@databricks.com> Closes #8844 from mateiz/spark-9852.
*	[SPARK-10761] Refactor DiskBlockObjectWriter to not require BlockId	Josh Rosen	2015-09-24	8	-30/+27
\| \| \| \| \| \| \| \|	The DiskBlockObjectWriter constructor took a BlockId parameter but never used it. As part of some general cleanup in these interfaces, this patch refactors its constructor to eliminate this parameter. Author: Josh Rosen <joshrosen@databricks.com> Closes #8871 from JoshRosen/disk-block-object-writer-blockid-cleanup.
*	Revert "[SPARK-6028][Core]A new RPC implemetation based on the network module"	Xiangrui Meng	2015-09-24	26	-1699/+63
\| \| \| \|	This reverts commit 084e4e126211d74a79e8dbd2d0e604dd3c650822.
*	[SPARK-10474] [SQL] Aggregation fails to allocate memory for pointer array ↵	Andrew Or	2015-09-23	1	-39/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	(round 2) This patch reverts most of the changes in a previous fix #8827. The real cause of the issue is that in `TungstenAggregate`'s prepare method we only reserve 1 page, but later when we switch to sort-based aggregation we try to acquire 1 page AND a pointer array. The longer-term fix should be to reserve also the pointer array, but for now *we will simply not track the pointer array*. (Note that elsewhere we already don't track the pointer array, e.g. [here](https://github.com/apache/spark/blob/a18208047f06a4244703c17023bb20cbe1f59d73/sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java#L88)) Note: This patch reuses the unit test added in #8827 so it doesn't show up in the diff. Author: Andrew Or <andrew@databricks.com> Closes #8888 from andrewor14/dont-track-pointer-array.
*	[SPARK-6028][Core]A new RPC implemetation based on the network module	zsxwing	2015-09-23	26	-63/+1699
\| \| \| \| \| \| \| \|	Design doc: https://docs.google.com/document/d/1CF5G6rGVQMKSyV_QKo4D2M-x6rxz5x1Ew7aK3Uq6u8c/edit?usp=sharing Author: zsxwing <zsxwing@gmail.com> Closes #6457 from zsxwing/new-rpc.
*	[SPARK-10731] [SQL] Delegate to Scala's DataFrame.take implementation in ↵	Reynold Xin	2015-09-23	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Python DataFrame. Python DataFrame.head/take now requires scanning all the partitions. This pull request changes them to delegate the actual implementation to Scala DataFrame (by calling DataFrame.take). This is more of a hack for fixing this issue in 1.5.1. A more proper fix is to change executeCollect and executeTake to return InternalRow rather than Row, and thus eliminate the extra round-trip conversion. Author: Reynold Xin <rxin@databricks.com> Closes #8876 from rxin/SPARK-10731.
*	[SPARK-10721] Log warning when file deletion fails	tedyu	2015-09-23	14	-25/+74
\| \| \| \| \| \|	Author: tedyu <yuzhihong@gmail.com> Closes #8843 from tedyu/master.
*	[SPARK-10652] [SPARK-10742] [STREAMING] Set meaningful job descriptions for ↵	Tathagata Das	2015-09-22	4	-12/+137
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	all streaming jobs Here is the screenshot after adding the job descriptions to threads that run receivers and the scheduler thread running the batch jobs. ## All jobs page * Added job descriptions with links to relevant batch details page ![image](https://cloud.githubusercontent.com/assets/663212/9924165/cda4a372-5cb1-11e5-91ca-d43a32c699e9.png) ## All stages page * Added stage descriptions with links to relevant batch details page ![image](https://cloud.githubusercontent.com/assets/663212/9923814/2cce266a-5cae-11e5-8a3f-dad84d06c50e.png) ## Streaming batch details page * Added the +details link ![image](https://cloud.githubusercontent.com/assets/663212/9921977/24014a32-5c98-11e5-958e-457b6c38065b.png) Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8791 from tdas/SPARK-10652.
*	[SPARK-10640] History server fails to parse TaskCommitDenied	Andrew Or	2015-09-22	3	-1/+35
\| \| \| \| \| \| \| \|	... simply because the code is missing! Author: Andrew Or <andrew@databricks.com> Closes #8828 from andrewor14/task-end-reason-json.
*	[SPARK-10714] [SPARK-8632] [SPARK-10685] [SQL] Refactor Python UDF handling	Reynold Xin	2015-09-22	1	-11/+43
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch refactors Python UDF handling: 1. Extract the per-partition Python UDF calling logic from PythonRDD into a PythonRunner. PythonRunner itself expects iterator as input/output, and thus has no dependency on RDD. This way, we can use PythonRunner directly in a mapPartitions call, or in the future in an environment without RDDs. 2. Use PythonRunner in Spark SQL's BatchPythonEvaluation. 3. Updated BatchPythonEvaluation to only use its input once, rather than twice. This should fix Python UDF performance regression in Spark 1.5. There are a number of small cleanups I wanted to do when I looked at the code, but I kept most of those out so the diff looks small. This basically implements the approach in https://github.com/apache/spark/pull/8833, but with some code moving around so the correctness doesn't depend on the inner workings of Spark serialization and task execution. Author: Reynold Xin <rxin@databricks.com> Closes #8835 from rxin/python-iter-refactor.
*	[SPARK-10704] Rename HashShuffleReader to BlockStoreShuffleReader	Josh Rosen	2015-09-22	4	-10/+7
\| \| \| \| \| \| \| \|	The current shuffle code has an interface named ShuffleReader with only one implementation, HashShuffleReader. This naming is confusing, since the same read path code is used for both sort- and hash-based shuffle. This patch addresses this by renaming HashShuffleReader to BlockStoreShuffleReader. Author: Josh Rosen <joshrosen@databricks.com> Closes #8825 from JoshRosen/shuffle-reader-cleanup.
*	[SPARK-9585] Delete the input format caching because some input format are ↵	xutingjun	2015-09-22	1	-6/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	non thread safe If we cache the InputFormat, all tasks on the same executor will share it. Some InputFormat is thread safety, but some are not, such as HiveHBaseTableInputFormat. If tasks share a non thread safe InputFormat, unexpected error may be occurs. To avoid it, I think we should delete the input format caching. Author: xutingjun <xutingjun@huawei.com> Author: meiyoula <1039320815@qq.com> Author: Xutingjun <xutingjun@huawei.com> Closes #7918 from XuTingjun/cached_inputFormat.
*	[SPARK-10718] [BUILD] Update License on conf files and corresponding ↵	Rekha Joshi	2015-09-22	2	-0/+34
\| \| \| \| \| \| \| \| \| \| \|	excludes file update Update License on conf files and corresponding excludes file update Author: Rekha Joshi <rekhajoshm@gmail.com> Author: Joshi <rekhajoshm@gmail.com> Closes #8842 from rekhajoshm/SPARK-10718.
*	[Minor] style fix for previous commit f24316e	Andrew Or	2015-09-22	1	-0/+1
\|
*	[SPARK-10458] [SPARK CORE] Added isStopped() method in SparkContext	Madhusudanan Kandasamy	2015-09-22	1	-0/+4
\| \| \| \| \| \| \| \|	Added isStopped() method in SparkContext Author: Madhusudanan Kandasamy <madhusudanan@in.ibm.com> Closes #8749 from kmadhugit/SPARK-10458.
*	[SPARK-10711] [SPARKR] Do not assume spark.submit.deployMode is always set	Hossein	2015-09-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	In ```RUtils.sparkRPackagePath()``` we 1. Call ``` sys.props("spark.submit.deployMode")``` which returns null if ```spark.submit.deployMode``` is not suet 2. Call ``` sparkConf.get("spark.submit.deployMode")``` which throws ```NoSuchElementException``` if ```spark.submit.deployMode``` is not set. This patch simply passes a default value ("cluster") for ```spark.submit.deployMode```. cc rxin Author: Hossein <hossein@databricks.com> Closes #8832 from falaki/SPARK-10711.
*	[SPARK-10649] [STREAMING] Prevent inheriting job group and irrelevant job ↵	Tathagata Das	2015-09-21	2	-1/+82
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	description in streaming jobs The job group, and job descriptions information is passed through thread local properties, and get inherited by child threads. In case of spark streaming, the streaming jobs inherit these properties from the thread that called streamingContext.start(). This may not make sense. 1. Job group: This is mainly used for cancelling a group of jobs together. It does not make sense to cancel streaming jobs like this, as the effect will be unpredictable. And its not a valid usecase any way, to cancel a streaming context, call streamingContext.stop() 2. Job description: This is used to pass on nice text descriptions for jobs to show up in the UI. The job description of the thread that calls streamingContext.start() is not useful for all the streaming jobs, as it does not make sense for all of the streaming jobs to have the same description, and the description may or may not be related to streaming. The solution in this PR is meant for the Spark master branch, where local properties are inherited by cloning the properties. The job group and job description in the thread that starts the streaming scheduler are explicitly removed, so that all the subsequent child threads does not inherit them. Also, the starting is done in a new child thread, so that setting the job group and description for streaming, does not change those properties in the thread that called streamingContext.start(). Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8781 from tdas/SPARK-10649.
*	[SPARK-5259] [CORE] don't submit stage until its dependencies map outputs ↵	hushan[胡珊]	2015-09-21	4	-24/+191
\| \| \| \| \| \| \| \| \| \| \| \| \|	are registered Track pending tasks by partition ID instead of Task objects. Before this change, failure & retry could result in a case where a stage got submitted before the map output from its dependencies get registered. This was due to an error in the condition for registering map outputs. Author: hushan[胡珊] <hushan@xiaomi.com> Author: Imran Rashid <irashid@cloudera.com> Closes #7699 from squito/SPARK-5259.
*	[SPARK-7989] [SPARK-10651] [CORE] [TESTS] Increase timeout to fix flaky tests	zsxwing	2015-09-21	3	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I noticed only one block manager registered with master in an unsuccessful build (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/3534/) ``` 15/09/16 13:02:30.981 pool-1-thread-1-ScalaTest-running-BroadcastSuite INFO SparkContext: Running Spark version 1.6.0-SNAPSHOT ... 15/09/16 13:02:38.133 sparkDriver-akka.actor.default-dispatcher-19 INFO BlockManagerMasterEndpoint: Registering block manager localhost:48196 with 530.3 MB RAM, BlockManagerId(0, localhost, 48196) ``` In addition, the first block manager needed 7+ seconds to start. But the test expected 2 block managers so it failed. However, there was no exception in this log file. So I checked a successful build (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3536/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/) and it needed 4-5 seconds to set up the local cluster: ``` 15/09/16 18:11:27.738 sparkWorker1-akka.actor.default-dispatcher-5 INFO Worker: Running Spark version 1.6.0-SNAPSHOT ... 15/09/16 18:11:30.838 sparkDriver-akka.actor.default-dispatcher-20 INFO BlockManagerMasterEndpoint: Registering block manager localhost:54202 with 530.3 MB RAM, BlockManagerId(1, localhost, 54202) 15/09/16 18:11:32.112 sparkDriver-akka.actor.default-dispatcher-20 INFO BlockManagerMasterEndpoint: Registering block manager localhost:32955 with 530.3 MB RAM, BlockManagerId(0, localhost, 32955) ``` In this build, the first block manager needed only 3+ seconds to start. Comparing these two builds, I guess it's possible that the local cluster in `BroadcastSuite` cannot be ready in 10 seconds if the Jenkins worker is busy. So I just increased the timeout to 60 seconds to see if this can fix the issue. Author: zsxwing <zsxwing@gmail.com> Closes #8813 from zsxwing/fix-BroadcastSuite.
*	[SPARK-10710] Remove ability to disable spilling in core and SQL	Josh Rosen	2015-09-19	6	-94/+51
\| \| \| \| \| \| \| \| \| \|	It does not make much sense to set `spark.shuffle.spill` or `spark.sql.planner.externalSort` to false: I believe that these configurations were initially added as "escape hatches" to guard against bugs in the external operators, but these operators are now mature and well-tested. In addition, these configurations are not handled in a consistent way anymore: SQL's Tungsten codepath ignores these configurations and will continue to use spilling operators. Similarly, Spark Core's `tungsten-sort` shuffle manager does not respect `spark.shuffle.spill=false`. This pull request removes these configurations, adds warnings at the appropriate places, and deletes a large amount of code which was only used in code paths that did not support spilling. Author: Josh Rosen <joshrosen@databricks.com> Closes #8831 from JoshRosen/remove-ability-to-disable-spilling.
*	[SPARK-10474] [SQL] Aggregation fails to allocate memory for pointer array	Andrew Or	2015-09-18	1	-2/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When `TungstenAggregation` hits memory pressure, it switches from hash-based to sort-based aggregation in-place. However, in the process we try to allocate the pointer array for writing to the new `UnsafeExternalSorter` before actually freeing the memory from the hash map. This lead to the following exception: ``` java.io.IOException: Could not acquire 65536 bytes of memory at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.<init>(UnsafeKVExternalSorter.java:126) at org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435) ``` Author: Andrew Or <andrew@databricks.com> Closes #8827 from andrewor14/allocate-pointer-array.
*	[SPARK-10611] Clone Configuration for each task for NewHadoopRDD	Mingyu Kim	2015-09-18	2	-8/+34
\| \| \| \| \| \| \| \|	This patch attempts to fix the Hadoop Configuration thread safety issue for NewHadoopRDD in the same way SPARK-2546 fixed the issue for HadoopRDD. Author: Mingyu Kim <mkim@palantir.com> Closes #8763 from mingyukim/mkim/SPARK-10611.
*	[SPARK-9808] Remove hash shuffle file consolidation.	Reynold Xin	2015-09-18	4	-287/+13
\| \| \| \| \| \|	Author: Reynold Xin <rxin@databricks.com> Closes #8812 from rxin/SPARK-9808-1.
*	[SPARK-9522] [SQL] SparkSubmit process can not exit if kill application when ↵	linweizhong	2015-09-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	HiveThriftServer was starting When we start HiveThriftServer, we will start SparkContext first, then start HiveServer2, if we kill application while HiveServer2 is starting then SparkContext will stop successfully, but SparkSubmit process can not exit. Author: linweizhong <linweizhong@huawei.com> Closes #7853 from Sephiroth-Lin/SPARK-9522.
*	[SPARK-10531] [CORE] AppId is set as AppName in status rest api	Jeff Zhang	2015-09-17	5	-12/+13
\| \| \| \| \| \| \| \|	Verify it manually. Author: Jeff Zhang <zjffdu@apache.org> Closes #8688 from zjffdu/SPARK-10531.
*	[SPARK-10172] [CORE] disable sort in HistoryServer webUI	Josiah Samuel	2015-09-17	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \|	This pull request is to address the JIRA SPARK-10172 (History Server web UI gets messed up when sorting on any column). The content of the table gets messed up due to the rowspan attribute of the table data(cell) during sorting. The current table sort library used in SparkUI (sorttable.js) doesn't support/handle cells(td) with rowspans. The fix will disable the table sort in the web UI, when there are jobs listed with multiple attempts. Author: Josiah Samuel <josiah_sams@in.ibm.com> Closes #8506 from josiahsams/SPARK-10172.
*	[MINOR] [CORE] Fixes minor variable name typo	Cheng Lian	2015-09-17	1	-2/+2
\| \| \| \| \| \|	Author: Cheng Lian <lian@databricks.com> Closes #8784 from liancheng/typo-fix.
*	[SPARK-10050] [SPARKR] Support collecting data of MapType in DataFrame.	Sun Rui	2015-09-16	1	-0/+31
\| \| \| \| \| \| \| \| \|	1. Support collecting data of MapType from DataFrame. 2. Support data of MapType in createDataFrame. Author: Sun Rui <rui.sun@intel.com> Closes #8711 from sun-rui/SPARK-10050.
*	[SPARK-10589] [WEBUI] Add defense against external site framing	Sean Owen	2015-09-16	5	-11/+24
\| \| \| \| \| \| \| \|	Set `X-Frame-Options: SAMEORIGIN` to protect against frame-related vulnerability Author: Sean Owen <sowen@cloudera.com> Closes #8745 from srowen/SPARK-10589.
*	[SPARK-10381] Fix mixup of taskAttemptNumber & attemptId in ↵	Josh Rosen	2015-09-15	13	-63/+135
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	OutputCommitCoordinator When speculative execution is enabled, consider a scenario where the authorized committer of a particular output partition fails during the OutputCommitter.commitTask() call. In this case, the OutputCommitCoordinator is supposed to release that committer's exclusive lock on committing once that task fails. However, due to a unit mismatch (we used task attempt number in one place and task attempt id in another) the lock will not be released, causing Spark to go into an infinite retry loop. This bug was masked by the fact that the OutputCommitCoordinator does not have enough end-to-end tests (the current tests use many mocks). Other factors contributing to this bug are the fact that we have many similarly-named identifiers that have different semantics but the same data types (e.g. attemptNumber and taskAttemptId, with inconsistent variable naming which makes them difficult to distinguish). This patch adds a regression test and fixes this bug by always using task attempt numbers throughout this code. Author: Josh Rosen <joshrosen@databricks.com> Closes #8544 from JoshRosen/SPARK-10381.
*	[SPARK-10575] [SPARK CORE] Wrapped RDD.takeSample with Scope	vinodkc	2015-09-15	1	-37/+31
\| \| \| \| \| \| \| \| \| \|	Remove return statements in RDD.takeSample and wrap it withScope Author: vinodkc <vinod.kc.in@gmail.com> Author: vinodkc <vinodkc@users.noreply.github.com> Author: Vinod K C <vinod.kc@huawei.com> Closes #8730 from vinodkc/fix_takesample_return.
*	[SPARK-10548] [SPARK-10563] [SQL] Fix concurrent SQL executions	Andrew Or	2015-09-15	2	-43/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Note: this is for master branch only. The fix for branch-1.5 is at #8721. The query execution ID is currently passed from a thread to its children, which is not the intended behavior. This led to `IllegalArgumentException: spark.sql.execution.id is already set` when running queries in parallel, e.g.: ``` (1 to 100).par.foreach { _ => sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count() } ``` The cause is `SparkContext`'s local properties are inherited by default. This patch adds a way to exclude keys we don't want to be inherited, and makes SQL go through that code path. Author: Andrew Or <andrew@databricks.com> Closes #8710 from andrewor14/concurrent-sql-executions.
*	Revert "[SPARK-10300] [BUILD] [TESTS] Add support for test tags in ↵	Marcelo Vanzin	2015-09-15	1	-0/+10
\| \| \| \| \| \|	run-tests.py." This reverts commit 8abef21dac1a6538c4e4e0140323b83d804d602b.
*	[SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py.	Marcelo Vanzin	2015-09-15	1	-10/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This change does two things: - tag a few tests and adds the mechanism in the build to be able to disable those tags, both in maven and sbt, for both junit and scalatest suites. - add some logic to run-tests.py to disable some tags depending on what files have changed; that's used to disable expensive tests when a module hasn't explicitly been changed, to speed up testing for changes that don't directly affect those modules. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8437 from vanzin/test-tags.
*	Update version to 1.6.0-SNAPSHOT.	Reynold Xin	2015-09-15	2	-2/+2
\| \| \| \| \| \|	Author: Reynold Xin <rxin@databricks.com> Closes #8350 from rxin/1.6.
*	[SPARK-9851] Support submitting map stages individually in DAGScheduler	Matei Zaharia	2015-09-14	12	-63/+710
\| \| \| \| \| \| \| \| \| \|	This patch adds support for submitting map stages in a DAG individually so that we can make downstream decisions after seeing statistics about their output, as part of SPARK-9850. I also added more comments to many of the key classes in DAGScheduler. By itself, the patch is not super useful except maybe to switch between a shuffle and broadcast join, but with the other subtasks of SPARK-9850 we'll be able to do more interesting decisions. The main entry point is SparkContext.submitMapStage, which lets you run a map stage and see stats about the map output sizes. Other stats could also be collected through accumulators. See AdaptiveSchedulingSuite for a short example. Author: Matei Zaharia <matei@databricks.com> Closes #8180 from mateiz/spark-9851.
*	[SPARK-10564] ThreadingSuite: assertion failures in threads don't fail the ↵	Andrew Or	2015-09-14	1	-8/+15
\| \| \| \| \| \| \| \| \| \|	test (round 2) This is a follow-up patch to #8723. I missed one case there. Author: Andrew Or <andrew@databricks.com> Closes #8727 from andrewor14/fix-threading-suite.
*	[SPARK-10543] [CORE] Peak Execution Memory Quantile should be Per-task Basis	Forest Fang	2015-09-14	2	-8/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Read `PEAK_EXECUTION_MEMORY` using `update` to get per task partial value instead of cumulative value. I tested with this workload: ```scala val size = 1000 val repetitions = 10 val data = sc.parallelize(1 to size, 5).map(x => (util.Random.nextInt(size / repetitions),util.Random.nextDouble)).toDF("key", "value") val res = data.toDF.groupBy("key").agg(sum("value")).count ``` Before: ![image](https://cloud.githubusercontent.com/assets/4317392/9828197/07dd6874-58b8-11e5-9bd9-6ba927c38b26.png) After: ![image](https://cloud.githubusercontent.com/assets/4317392/9828151/a5ddff30-58b7-11e5-8d31-eda5dc4eae79.png) Tasks view: ![image](https://cloud.githubusercontent.com/assets/4317392/9828199/17dc2b84-58b8-11e5-92a8-be89ce4d29d1.png) cc andrewor14 I appreciate if you can give feedback on this since I think you introduced display of this metric. Author: Forest Fang <forest.fang@outlook.com> Closes #8726 from saurfang/stagepage.
*	[SPARK-10576] [BUILD] Move .java files out of src/main/scala	Sean Owen	2015-09-14	4	-0/+0
\| \| \| \| \| \| \| \|	Move .java files in `src/main/scala` to `src/main/java` root, except for `package-info.java` (to stay next to package.scala) Author: Sean Owen <sowen@cloudera.com> Closes #8736 from srowen/SPARK-10576.
*	[SPARK-9899] [SQL] log warning for direct output committer with speculation ↵	Wenchen Fan	2015-09-14	1	-6/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	enabled This is a follow-up of https://github.com/apache/spark/pull/8317. When speculation is enabled, there may be multiply tasks writing to the same path. Generally it's OK as we will write to a temporary directory first and only one task can commit the temporary directory to target path. However, when we use direct output committer, tasks will write data to target path directly without temporary directory. This causes problems like corrupted data. Please see [PR comment](https://github.com/apache/spark/pull/8191#issuecomment-131598385) for more details. Unfortunately, we don't have a simple flag to tell if a output committer will write to temporary directory or not, so for safety, we have to disable any customized output committer when `speculation` is true. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8687 from cloud-fan/direct-committer.
*	[SPARK-10330] Add Scalastyle rule to require use of SparkHadoopUtil ↵	Josh Rosen	2015-09-12	5	-9/+17
\| \| \| \| \| \| \| \| \| \|	JobContext methods This is a followup to #8499 which adds a Scalastyle rule to mandate the use of SparkHadoopUtil's JobContext accessor methods and fixes the existing violations. Author: Josh Rosen <joshrosen@databricks.com> Closes #8521 from JoshRosen/SPARK-10330-part2.
*	[SPARK-10547] [TEST] Streamline / improve style of Java API tests	Sean Owen	2015-09-12	1	-227/+224
\| \| \| \| \| \| \| \|	Fix a few Java API test style issues: unused generic types, exceptions, wrong assert argument order Author: Sean Owen <sowen@cloudera.com> Closes #8706 from srowen/SPARK-10547.
*	[SPARK-10554] [CORE] Fix NPE with ShutdownHook	Nithin Asokan	2015-09-12	1	-1/+3
\| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-10554 Fixes NPE when ShutdownHook tries to cleanup temporary folders Author: Nithin Asokan <Nithin.Asokan@Cerner.com> Closes #8720 from nasokan/SPARK-10554.
*	[SPARK-10566] [CORE] SnappyCompressionCodec init exception handling masks ↵	Daniel Imfeld	2015-09-12	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	important error information When throwing an IllegalArgumentException in SnappyCompressionCodec.init, chain the existing exception. This allows potentially important debugging info to be passed to the user. Manual testing shows the exception chained properly, and the test suite still looks fine as well. This contribution is my original work and I license the work to the project under the project's open source license. Author: Daniel Imfeld <daniel@danielimfeld.com> Closes #8725 from dimfeld/dimfeld-patch-1.
*	[SPARK-10564] ThreadingSuite: assertion failures in threads don't fail the test	Andrew Or	2015-09-11	1	-23/+45
\| \| \| \| \| \| \| \|	This commit ensures if an assertion fails within a thread, it will ultimately fail the test. Otherwise we end up potentially masking real bugs by not propagating assertion failures properly. Author: Andrew Or <andrew@databricks.com> Closes #8723 from andrewor14/fix-threading-suite.
*	[SPARK-10546] Check partitionId's range in ExternalSorter#spill()	tedyu	2015-09-11	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	See this thread for background: http://search-hadoop.com/m/q3RTt0rWvIkHAE81 We should check the range of partition Id and provide meaningful message through exception. Alternatively, we can use abs() and modulo to force the partition Id into legitimate range. However, expectation is that user should correct the logic error in his / her code. Author: tedyu <yuzhihong@gmail.com> Closes #8703 from tedyu/master.
*	[SPARK-9043] Serialize key, value and combiner classes in ShuffleDependency	Matt Massie	2015-09-10	8	-22/+167
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ShuffleManager implementations are currently not given type information for the key, value and combiner classes. Serialization of shuffle objects relies on objects being JavaSerializable, with methods defined for reading/writing the object or, alternatively, serialization via Kryo which uses reflection. Serialization systems like Avro, Thrift and Protobuf generate classes with zero argument constructors and explicit schema information (e.g. IndexedRecords in Avro have get, put and getSchema methods). By serializing the key, value and combiner class names in ShuffleDependency, shuffle implementations will have access to schema information when registerShuffle() is called. Author: Matt Massie <massie@cs.berkeley.edu> Closes #7403 from massie/shuffle-classtags.
*	[SPARK-10049] [SPARKR] Support collecting data of ArraryType in DataFrame.	Sun Rui	2015-09-10	2	-85/+145
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	this PR : 1. Enhance reflection in RBackend. Automatically matching a Java array to Scala Seq when finding methods. Util functions like seq(), listToSeq() in R side can be removed, as they will conflict with the Serde logic that transferrs a Scala seq to R side. 2. Enhance the SerDe to support transferring a Scala seq to R side. Data of ArrayType in DataFrame after collection is observed to be of Scala Seq type. 3. Support ArrayType in createDataFrame(). Author: Sun Rui <rui.sun@intel.com> Closes #8458 from sun-rui/SPARK-10049.
*	[SPARK-10514] [MESOS] waiting for min no of total cores acquired by Spark by ↵	Akash Mishra	2015-09-10	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	implementing the sufficientResourcesRegistered method spark.scheduler.minRegisteredResourcesRatio configuration parameter works for YARN mode but not for Mesos Coarse grained mode. If the parameter specified default value of 0 will be set for spark.scheduler.minRegisteredResourcesRatio in base class and this method will always return true. There are no existing test for YARN mode too. Hence not added test for the same. Author: Akash Mishra <akash.mishra20@gmail.com> Closes #8672 from SleepyThread/master.