spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-15235][WEBUI] Corresponding row cannot be highlighted even though ↵	Kousuke Saruta	2016-05-10	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	cursor is on the job on Web UI's timeline ## What changes were proposed in this pull request? To extract job descriptions and stage name, there are following regular expressions in timeline-view.js ``` var jobIdText = $($(baseElem).find(".application-timeline-content")[0]).text(); var jobId = jobIdText.match("\$Job (\\d+)\$")[1]; ... var stageIdText = $($(baseElem).find(".job-timeline-content")[0]).text(); var stageIdAndAttempt = stageIdText.match("\$Stage (\\d+\\.\\d+)\$")[1].split("."); ``` But if job descriptions include patterns like "(Job x)" or stage names include patterns like "(Stage x.y)", the regular expressions cannot be match as we expected, ending up with corresponding row cannot be highlighted even though we move the cursor onto the job on Web UI's timeline. ## How was this patch tested? Manually tested with spark-shell and Web UI. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #13016 from sarutak/SPARK-15235.
*	[SPARK-15246][SPARK-4452][CORE] Fix code style and improve volatile for	Lianhui Wang	2016-05-10	2	-3/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? 1. Fix code style 2. remove volatile of elementsRead method because there is only one thread to use it. 3. avoid volatile of _elementsRead because Collection increases number of _elementsRead when it insert a element. It is very expensive. So we can avoid it. After this PR, I will push another PR for branch 1.6. ## How was this patch tested? unit tests Author: Lianhui Wang <lianhuiwang09@gmail.com> Closes #13020 from lianhuiwang/SPARK-4452-hotfix.
*	[SPARK-12837][CORE] reduce network IO for accumulators	Wenchen Fan	2016-05-10	4	-11/+41
\| \| \| \| \| \| \| \| \| \|	Sending un-updated accumulators back to driver makes no sense, as merging a zero value accumulator is a no-op. We should only send back updated accumulators, to save network IO. new test in `TaskContextSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #12899 from cloud-fan/acc.
*	[SPARK-11249][LAUNCHER] Throw error if app resource is not provided.	Marcelo Vanzin	2016-05-10	3	-8/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Without this, the code would build an invalid spark-submit command line, and a more cryptic error would be presented to the user. Also, expose a constant that allows users to set a dummy resource in cases where they don't need an actual resource file; for backwards compatibility, that uses the same "spark-internal" resource that Spark itself uses. Tested via unit tests, run-example, spark-shell, and running the thrift server with mixed spark and hive command line arguments. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #12909 from vanzin/SPARK-11249.
*	[SPARK-14542][CORE] PipeRDD should allow configurable buffer size for…	Sital Kedia	2016-05-10	4	-27/+50
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Currently PipedRDD internally uses PrintWriter to write data to the stdin of the piped process, which by default uses a BufferedWriter of buffer size 8k. In our experiment, we have seen that 8k buffer size is too small and the job spends significant amount of CPU time in system calls to copy the data. We should have a way to configure the buffer size for the writer. ## How was this patch tested? Ran PipedRDDSuite tests. Author: Sital Kedia <skedia@fb.com> Closes #12309 from sitalkedia/bufferedPipedRDD.
*	[SPARK-15209] Fix display of job descriptions with single quotes in web UI ↵	Josh Rosen	2016-05-10	2	-8/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	timeline ## What changes were proposed in this pull request? This patch fixes an escaping bug in the Web UI's event timeline that caused Javascript errors when displaying timeline entries whose descriptions include single quotes. The original bug can be reproduced by running ```scala sc.setJobDescription("double quote: \" ") sc.parallelize(1 to 10).count() sc.setJobDescription("single quote: ' ") sc.parallelize(1 to 10).count() ``` and then browsing to the driver UI. Previously, this resulted in an "Uncaught SyntaxError" because the single quote from the description was not escaped and ended up closing a Javascript string literal too early. The fix implemented here is to change the relevant Javascript to define its string literals using double-quotes. Our escaping logic already properly escapes double quotes in the description, so this is safe to do. ## How was this patch tested? Tested manually in `spark-shell` using the following cases: ```scala sc.setJobDescription("double quote: \" ") sc.parallelize(1 to 10).count() sc.setJobDescription("single quote: ' ") sc.parallelize(1 to 10).count() sc.setJobDescription("ampersand: &") sc.parallelize(1 to 10).count() sc.setJobDescription("newline: \n text after newline ") sc.parallelize(1 to 10).count() sc.setJobDescription("carriage return: \r text after return ") sc.parallelize(1 to 10).count() ``` /cc sarutak for review. Author: Josh Rosen <joshrosen@databricks.com> Closes #12995 from JoshRosen/SPARK-15209.
*	[SPARK-10653][CORE] Remove unnecessary things from SparkEnv	Alex Bozarth	2016-05-09	4	-24/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Removed blockTransferService and sparkFilesDir from SparkEnv since they're rarely used and don't need to be in stored in the env. Edited their few usages to accommodate the change. ## How was this patch tested? ran dev/run-tests locally Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #12970 from ajbozarth/spark10653.
*	[SAPRK-15220][UI] add hyperlink to running application and completed application	mwws	2016-05-09	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add hyperlink to "running application" and "completed application", so user can jump to application table directly, In my environment, I set up 1000+ works and it's painful to scroll down to skip worker list. ## How was this patch tested? manual tested (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) ![sceenshot](https://cloud.githubusercontent.com/assets/13216322/15105718/97e06768-15f6-11e6-809d-3574046751a9.png) Author: mwws <wei.mao@intel.com> Closes #12997 from mwws/SPARK_UI.
*	[SPARK-15087][MINOR][DOC] Follow Up: Fix the Comments	Sandeep Singh	2016-05-07	1	-5/+0
\| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Remove the Comment, since it not longer applies. see the discussion here(https://github.com/apache/spark/pull/12865#discussion-diff-61946906) Author: Sandeep Singh <sandeep@techaddict.me> Closes #12953 from techaddict/SPARK-15087-FOLLOW-UP.
*	[SPARK-1239] Improve fetching of map output statuses	Thomas Graves	2016-05-06	6	-83/+287
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The main issue we are trying to solve is the memory bloat of the Driver when tasks request the map output statuses. This means with a large number of tasks you either need a huge amount of memory on Driver or you have to repartition to smaller number. This makes it really difficult to run over say 50000 tasks. The main issues that cause the memory bloat are: 1) no flow control on sending the map output status responses. We serialize the map status output and then hand off to netty to send. netty is sending asynchronously and it can't send them fast enough to keep up with incoming requests so we end up with lots of copies of the serialized map output statuses sitting there and this causes huge bloat when you have 10's of thousands of tasks and map output status is in the 10's of MB. 2) When initial reduce tasks are started up, they all request the map output statuses from the Driver. These requests are handled by multiple threads in parallel so even though we check to see if we have a cached version, initially when we don't have a cached version yet, many of initial requests can all end up serializing the exact same map output statuses. This patch does a couple of things: - When the map output status size is over a threshold (default 512K) then it uses broadcast to send the map statuses. This means we no longer serialize a large map output status and thus we don't have issues with memory bloat. the messages sizes are now in the 300-400 byte range and the map status output are broadcast. If its under the threadshold it sends it as before, the message contains the DIRECT indicator now. - synchronize the incoming requests to allow one thread to cache the serialized output and broadcast the map output status that can then be used by everyone else. This ensures we don't create multiple broadcast variables when we don't need to. To ensure this happens I added a second thread pool which the Dispatcher hands the requests to so that those threads can block without blocking the main dispatcher threads (which would cause things like heartbeats and such not to come through) Note that some of design and code was contributed by mridulm ## How was this patch tested? Unit tests and a lot of manually testing. Ran with akka and netty rpc. Ran with both dynamic allocation on and off. one of the large jobs I used to test this was a join of 15TB of data. it had 200,000 map tasks, and 20,000 reduce tasks. Executors ranged from 200 to 2000. This job ran successfully with 5GB of memory on the driver with these changes. Without these changes I was using 20GB and only had 500 reduce tasks. The job has 50mb of serialized map output statuses and took roughly the same amount of time for the executors to get the map output statuses as before. Ran a variety of other jobs, from large wordcounts to small ones not using broadcasts. Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com> Closes #12113 from tgravescs/SPARK-1239.
*	[SPARK-15152][DOC][MINOR] Scaladoc and Code style Improvements	Jacek Laskowski	2016-05-05	4	-21/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Minor doc and code style fixes ## How was this patch tested? local build Author: Jacek Laskowski <jacek@japila.pl> Closes #12928 from jaceklaskowski/SPARK-15152.
*	[SPARK-9926] Parallelize partition logic in UnionRDD.	Ryan Blue	2016-05-05	2	-1/+34
\| \| \| \| \| \| \| \| \| \| \|	This patch has the new logic from #8512 that uses a parallel collection to compute partitions in UnionRDD. The rest of #8512 added an alternative code path for calculating splits in S3, but that isn't necessary to get the same speedup. The underlying problem wasn't that bulk listing wasn't used, it was that an extra FileStatus was retrieved for each file. The fix was just committed as [HADOOP-12810](https://issues.apache.org/jira/browse/HADOOP-12810). (I think the original commit also used a single prefix to enumerate all paths, but that isn't always helpful and it was removed in later versions so there is no need for SparkS3Utils.) I tested this using the same table that piapiaozhexiu was using. Calculating splits for a 10-day period took 25 seconds with this change and HADOOP-12810, which is on par with the results from #8512. Author: Ryan Blue <blue@apache.org> Author: Cheolsoo Park <cheolsoop@netflix.com> Closes #11242 from rdblue/SPARK-9926-parallelize-union-rdd.
*	[SPARK-15158][CORE] downgrade shouldRollover message to debug level	depend	2016-05-05	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? set log level to debug when check shouldRollover ## How was this patch tested? It's tested manually. Author: depend <depend@gmail.com> Closes #12931 from depend/master.
*	[SPARK-14915][CORE] Don't re-queue a task if another attempt has already ↵	Jason Moore	2016-05-05	1	-1/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	succeeded ## What changes were proposed in this pull request? Don't re-queue a task if another attempt has already succeeded. This currently happens when a speculative task is denied from committing the result due to another copy of the task already having succeeded. ## How was this patch tested? I'm running a job which has a fair bit of skew in the processing time across the tasks for speculation to trigger in the last quarter (default settings), causing many commit denied exceptions to be thrown. Previously, these tasks were then being retried over and over again until the stage possibly completes (despite using compute resources on these superfluous tasks). With this change (applied to the 1.6 branch), they no longer retry and the stage completes successfully without these extra task attempts. Author: Jason Moore <jasonmoore2k@outlook.com> Closes #12751 from jasonmoore2k/SPARK-14915.
*	[SPARK-12154] Upgrade to Jersey 2	mcheah	2016-05-05	3	-20/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Replace com.sun.jersey with org.glassfish.jersey. Changes to the Spark Web UI code were required to compile. The changes were relatively standard Jersey migration things. ## How was this patch tested? I did a manual test for the standalone web APIs. Although I didn't test the functionality of the security filter itself, the code that changed non-trivially is how we actually register the filter. I attached a debugger to the Spark master and verified that the SecurityFilter code is indeed invoked upon hitting /api/v1/applications. Author: mcheah <mcheah@palantir.com> Closes #12715 from mccheah/feature/upgrade-jersey.
*	[SPARK-15045] [CORE] Remove dead code in ↵	Abhinav Gupta	2016-05-04	1	-6/+7
\| \| \| \| \| \| \| \| \| \| \| \|	TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable ## What changes were proposed in this pull request? Removed the DeadCode as suggested. Author: Abhinav Gupta <abhi.951990@gmail.com> Closes #12829 from abhi951990/master.
*	[SPARK-13001][CORE][MESOS] Prevent getting offers when reached max cores	Sebastien Rainville	2016-05-04	3	-17/+53
\| \| \| \| \| \| \| \| \| \|	Similar to https://github.com/apache/spark/pull/8639 This change rejects offers for 120s when reached `spark.cores.max` in coarse-grained mode to mitigate offer starvation. This prevents Mesos to send us offers again and again, starving other frameworks. This is especially problematic when running many small frameworks on the same Mesos cluster, e.g. many small Sparks streaming jobs, and cause the bigger spark jobs to stop receiving offers. By rejecting the offers for a long period of time, they become available to those other frameworks. Author: Sebastien Rainville <sebastien@hopper.com> Closes #10924 from sebastienrainville/master.
*	[SPARK-12299][CORE] Remove history serving functionality from Master	Bryan Cutler	2016-05-04	10	-311/+86
\| \| \| \| \| \| \| \| \| \|	Remove history server functionality from standalone Master. Previously, the Master process rebuilt a SparkUI once the application was completed which sometimes caused problems, such as OOM, when the application event log is large (see SPARK-6270). Keeping this functionality out of the Master will help to simplify the process and increase stability. Testing for this change included running core unit tests and manually running an application on a standalone cluster to verify that it completed successfully and that the Master UI functioned correctly. Also added 2 unit tests to verify killing an application and driver from MasterWebUI makes the correct request to the Master. Author: Bryan Cutler <cutlerb@gmail.com> Closes #10991 from BryanCutler/remove-history-master-SPARK-12299.
*	[SPARK-15115][SQL] Reorganize whole stage codegen benchmark suites	Reynold Xin	2016-05-04	3	-80/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We currently have a single suite that is very large, making it difficult to maintain and play with specific primitives. This patch reorganizes the file by creating multiple benchmark suites in a single package. Most of the changes are straightforward move of code. On top of the code moving, I did: 1. Use SparkSession instead of SQLContext. 2. Turned most benchmark scenarios into a their own test cases, rather than having multiple scenarios in a single test case, which takes forever to run. ## How was this patch tested? This is a test only change. Author: Reynold Xin <rxin@databricks.com> Closes #12891 from rxin/SPARK-15115.
*	[SPARK-4224][CORE][YARN] Support group acls	Dhruve Ashar	2016-05-04	7	-25/+405
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Currently only a list of users can be specified for view and modify acls. This change enables a group of admins/devs/users to be provisioned for viewing and modifying Spark jobs. Changes Proposed in the fix Three new corresponding config entries have been added where the user can specify the groups to be given access. ``` spark.admin.acls.groups spark.modify.acls.groups spark.ui.view.acls.groups ``` New config entries were added because specifying the users and groups explicitly is a better and cleaner way compared to specifying them in the existing config entry using a delimiter. A generic trait has been introduced to provide the user to group mapping which makes it pluggable to support a variety of mapping protocols - similar to the one used in hadoop. A default unix shell based implementation has been provided. Custom user to group mapping protocol can be specified and configured by the entry ```spark.user.groups.mapping``` How the patch was Tested We ran different spark jobs setting the config entries in combinations of admin, modify and ui acls. For modify acls we tried killing the job stages from the ui and using yarn commands. For view acls we tried accessing the UI tabs and the logs. Headless accounts were used to launch these jobs and different users tried to modify and view the jobs to ensure that the groups mapping applied correctly. Additional Unit tests have been added without modifying the existing ones. These test for different ways of setting the acls through configuration and/or API and validate the expected behavior. Author: Dhruve Ashar <dhruveashar@gmail.com> Closes #12760 from dhruve/impr/SPARK-4224.
*	[SPARK-15107][SQL] Allow varying # iterations by test case in Benchmark	Reynold Xin	2016-05-03	2	-12/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch changes our micro-benchmark util to allow setting different iteration numbers for different test cases. For some of our benchmarks, turning off whole-stage codegen can make the runtime 20X slower, making it very difficult to run a large number of times without substantially shortening the input cardinality. With this change, I set the default num iterations to 2 for whole stage codegen off, and 5 for whole stage codegen on. I also updated some results. ## How was this patch tested? N/A - this is a test util. Author: Reynold Xin <rxin@databricks.com> Closes #12884 from rxin/SPARK-15107.
*	[SPARK-14645][MESOS] Fix python running on cluster mode mesos to have non ↵	Timothy Chen	2016-05-03	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	local uris ## What changes were proposed in this pull request? Fix SparkSubmit to allow non-local python uris ## How was this patch tested? Manually tested with mesos-spark-dispatcher Author: Timothy Chen <tnachen@gmail.com> Closes #12403 from tnachen/enable_remote_python.
*	[SPARK-15104] Fix spacing in log line	Andrew Ash	2016-05-03	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Otherwise get logs that look like this (note no space before NODE_LOCAL) ``` INFO [2016-05-03 21:18:51,477] org.apache.spark.scheduler.TaskSetManager: Starting task 0.0 in stage 101.0 (TID 7029, localhost, partition 0,NODE_LOCAL, 1894 bytes) ``` Author: Andrew Ash <andrew@andrewash.com> Closes #12880 from ash211/patch-7.
*	[SPARK-11316] coalesce doesn't handle UnionRDD with partial locality properly	Thomas Graves	2016-05-03	2	-62/+165
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? coalesce doesn't handle UnionRDD with partial locality properly. I had a user who had a UnionRDD that was made up of mapPartitionRDD without preferred locations and a checkpointedRDD with preferred locations (getting from hdfs). It took the driver over 20 minutes to setup the groups and put the partitions into those groups before it even started any tasks. Even perhaps worse is it didn't end up with the number of partitions he was asking for because it didn't put a partition in each of the groups properly. The changes in this patch get rid of a n^2 while loop that was causing the 20 minutes, it properly distributes the partitions to have at least one per group, and it changes from using the rotation iterator which got the preferred locations many times to get all the preferred locations once up front. Note that the n^2 while loop that I removed in setupGroups took so long because all of the partitions with preferred locations were already assigned to group, so it basically looped through every single one and wasn't ever able to assign it. At the time I had 960 partitions with preferred locations and 1020 without and did the outer while loop 319 times because that is the # of groups left to create. Note that each of those times through the inner while loop is going off to hdfs to get the block locations, so this is extremely inefficient. ## How was the this patch tested? Added unit tests for this case and ran existing ones that applied to make sure no regressions. Also manually tested on the users production job to make sure it fixed their issue. It created the proper number of partitions and now it takes about 6 seconds rather then 20 minutes. I did also run some basic manual tests with spark-shell doing coalesced to smaller number, same number, and then greater with shuffle. Author: Thomas Graves <tgraves@prevailsail.corp.gq1.yahoo.com> Closes #11327 from tgravescs/SPARK-11316.
*	[SPARK-14234][CORE] Executor crashes for TaskRunner thread interruption	Devaraj K	2016-05-03	1	-1/+25
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Resetting the task interruption status before updating the task status. ## How was this patch tested? I have verified it manually by running multiple applications, Executor doesn't crash and updates the status to the driver without any exceptions with the patch changes. Author: Devaraj K <devaraj@apache.org> Closes #12031 from devaraj-kavali/SPARK-14234.
*	[SPARK-15059][CORE] Remove fine-grained lock in ChildFirstURLClassLoader to ↵	Zheng Tan	2016-05-03	1	-26/+5
\| \| \| \| \| \| \| \| \| \| \| \|	avoid dead lock ## What changes were proposed in this pull request? In some cases, fine-grained lock have race condition with class-loader lock and have caused dead lock issue. It is safe to drop this fine grained lock and load all classes by single class-loader lock. Author: Zheng Tan <zheng.tan@hulu.com> Closes #12857 from tankkyo/master.
*	[SPARK-15082][CORE] Improve unit test coverage for AccumulatorV2	Sandeep Singh	2016-05-03	1	-1/+60
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Added tests for ListAccumulator and LegacyAccumulatorWrapper, test for ListAccumulator is one similar to old Collection Accumulators ## How was this patch tested? Ran tests locally. cc rxin Author: Sandeep Singh <sandeep@techaddict.me> Closes #12862 from techaddict/SPARK-15082.
*	[SPARK-15087][CORE][SQL] Remove AccumulatorV2.localValue and keep only value	Sandeep Singh	2016-05-03	7	-33/+25
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Remove AccumulatorV2.localValue and keep only value ## How was this patch tested? existing tests Author: Sandeep Singh <sandeep@techaddict.me> Closes #12865 from techaddict/SPARK-15087.
*	[SPARK-15081] Move AccumulatorV2 and subclasses into util package	Reynold Xin	2016-05-03	26	-30/+36
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch moves AccumulatorV2 and subclasses into util package. ## How was this patch tested? Updated relevant tests. Author: Reynold Xin <rxin@databricks.com> Closes #12863 from rxin/SPARK-15081.
*	[SPARK-6717][ML] Clear shuffle files after checkpointing in ALS	Holden Karau	2016-05-03	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? When ALS is run with a checkpoint interval, during the checkpoint materialize the current state and cleanup the previous shuffles (non-blocking). ## How was this patch tested? Existing ALS unit tests, new ALS checkpoint cleanup unit tests added & shuffle files checked after ALS w/checkpointing run. Author: Holden Karau <holden@us.ibm.com> Author: Holden Karau <holden@pigscanfly.ca> Closes #11919 from holdenk/SPARK-6717-clear-shuffle-files-after-checkpointing-in-ALS.
*	[SPARK-15079] Support average/count/sum in Long/DoubleAccumulator	Reynold Xin	2016-05-02	5	-101/+181
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch removes AverageAccumulator and adds the ability to compute average to LongAccumulator and DoubleAccumulator. The patch also improves documentation for the two accumulators. ## How was this patch tested? Added unit tests for this. Author: Reynold Xin <rxin@databricks.com> Closes #12858 from rxin/SPARK-15079.
*	[SPARK-14685][CORE] Document heritability of localProperties	Marcin Tustin	2016-05-02	3	-2/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This updates the java-/scala- doc for setLocalProperty to document heritability of localProperties. This also adds tests for that behaviour. ## How was this patch tested? Tests pass. New tests were added. Author: Marcin Tustin <marcin.tustin@gmail.com> Closes #12455 from marcintustin/SPARK-14685.
*	[SPARK-15054] Deprecate old accumulator API	Reynold Xin	2016-05-02	3	-10/+20
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch deprecates the old accumulator API. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #12832 from rxin/SPARK-15054.
*	[SPARK-14845][SPARK_SUBMIT][YARN] spark.files in properties file is n…	Jeff Zhang	2016-05-02	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? initialize SparkSubmitArgument#files first from spark-submit arguments then from properties file, so that sys property spark.yarn.dist.files will be set correctly. ``` OptionAssigner(args.files, YARN, ALL_DEPLOY_MODES, sysProp = "spark.yarn.dist.files"), ``` ## How was this patch tested? manul test. file defined in properties file is also distributed to driver in yarn-cluster mode. Author: Jeff Zhang <zjffdu@apache.org> Closes #12656 from zjffdu/SPARK-14845.
*	[SPARK-15049] Rename NewAccumulator to AccumulatorV2	Reynold Xin	2016-05-01	22	-81/+82
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? NewAccumulator isn't the best name if we ever come up with v3 of the API. ## How was this patch tested? Updated tests to reflect the change. Author: Reynold Xin <rxin@databricks.com> Closes #12827 from rxin/SPARK-15049.
*	[SPARK-14505][CORE] Fix bug : creating two SparkContext objects in the same ↵	Allen	2016-05-01	2	-15/+16
\| \| \| \| \| \| \| \| \| \| \| \| \|	jvm, the first one will can not run any task! After creating two SparkContext objects in the same jvm(the second one can not be created successfully!), use the first one to run job will throw exception like below: ![image](https://cloud.githubusercontent.com/assets/7162889/14402832/0c8da2a6-fe73-11e5-8aba-68ee3ddaf605.png) Author: Allen <yufan_1990@163.com> Closes #12273 from the-sea/context-create-bug.
*	[SPARK-14952][CORE][ML] Remove methods that were deprecated in 1.6.0	Herman van Hovell	2016-04-30	1	-9/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	#### What changes were proposed in this pull request? This PR removes three methods the were deprecated in 1.6.0: - `PortableDataStream.close()` - `LinearRegression.weights` - `LogisticRegression.weights` The rationale for doing this is that the impact is small and that Spark 2.0 is a major release. #### How was this patch tested? Compilation succeded. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12732 from hvanhovell/SPARK-14952.
*	[SPARK-15028][SQL] Remove HiveSessionState.setDefaultOverrideConfs	Reynold Xin	2016-04-30	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch removes some code that are no longer relevant -- mainly HiveSessionState.setDefaultOverrideConfs. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #12806 from rxin/SPARK-15028.
*	[SPARK-15010][CORE] new accumulator shoule be tolerant of local RPC message ↵	Wenchen Fan	2016-04-29	1	-2/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	delivery ## What changes were proposed in this pull request? The RPC framework will not serialize and deserialize messages in local mode, we should not call `acc.value` when receive heartbeat message, because the serialization hook of new accumulator may not be triggered and the `atDriverSide` flag may not be set. ## How was this patch tested? tested it locally via spark shell Author: Wenchen Fan <wenchen@databricks.com> Closes #12795 from cloud-fan/bug.
*	[SPARK-15003] Use ConcurrentHashMap in place of HashMap for ↵	tedyu	2016-04-30	2	-13/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	NewAccumulator.originals ## What changes were proposed in this pull request? This PR proposes to use ConcurrentHashMap in place of HashMap for NewAccumulator.originals This should result in better performance. ## How was this patch tested? Existing unit test suite cloud-fan Author: tedyu <yuzhihong@gmail.com> Closes #12776 from tedyu/master.
*	[SPARK-12919][SPARKR] Implement dapply() on DataFrame in SparkR.	Sun Rui	2016-04-29	3	-5/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? dapply() applies an R function on each partition of a DataFrame and returns a new DataFrame. The function signature is: dapply(df, function(localDF) {}, schema = NULL) R function input: local data.frame from the partition on local node R function output: local data.frame Schema specifies the Row format of the resulting DataFrame. It must match the R function's output. If schema is not specified, each partition of the result DataFrame will be serialized in R into a single byte array. Such resulting DataFrame can be processed by successive calls to dapply(). ## How was this patch tested? SparkR unit tests. Author: Sun Rui <rui.sun@intel.com> Author: Sun Rui <sunrui2016@gmail.com> Closes #12493 from sun-rui/SPARK-12919.
*	[HOTFIX][CORE] fix a concurrence issue in NewAccumulator	Wenchen Fan	2016-04-28	3	-6/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `AccumulatorContext` is not thread-safe, that's why all of its methods are synchronized. However, there is one exception: the `AccumulatorContext.originals`. `NewAccumulator` use it to check if it's registered, which is wrong as it's not synchronized. This PR mark `AccumulatorContext.originals` as `private` and now all access to `AccumulatorContext` is synchronized. ## How was this patch tested? I verified it locally. To be safe, we can let jenkins test it many times to make sure this problem is gone. Author: Wenchen Fan <wenchen@databricks.com> Closes #12773 from cloud-fan/debug.
*	Revert "[SPARK-14613][ML] Add @Since into the matrix and vector classes in ↵	Yin Huai	2016-04-28	4	-1/+79
\| \| \| \| \| \|	spark-mllib-local" This reverts commit dae538a4d7c36191c1feb02ba87ffc624ab960dc.
*	[SPARK-14613][ML] Add @Since into the matrix and vector classes in ↵	Pravin Gadakh	2016-04-28	4	-79/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	spark-mllib-local ## What changes were proposed in this pull request? This PR adds `since` tag into the matrix and vector classes in spark-mllib-local. ## How was this patch tested? Scala-style checks passed. Author: Pravin Gadakh <prgadakh@in.ibm.com> Closes #12416 from pravingadakh/SPARK-14613.
*	[SPARK-14935][CORE] DistributedSuite "local-cluster format" shouldn't ↵	Xin Ren	2016-04-28	1	-12/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	actually launch clusters https://issues.apache.org/jira/browse/SPARK-14935 In DistributedSuite, the "local-cluster format" test actually launches a bunch of clusters, but this doesn't seem necessary for what should just be a unit test of a regex. We should clean up the code so that this is testable without actually launching a cluster, which should buy us about 20 seconds per build. Passed unit test on my local machine Author: Xin Ren <iamshrek@126.com> Closes #12744 from keypointt/SPARK-14935.
*	[SPARK-14576][WEB UI] Spark console should display Web UI url	Ergin Seyfe	2016-04-28	2	-6/+10
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This is a proposal to print the Spark Driver UI link when spark-shell is launched. ## How was this patch tested? Launched spark-shell in local mode and cluster mode. Spark-shell console output included following line: "Spark context Web UI available at <Spark web url>" Author: Ergin Seyfe <eseyfe@fb.com> Closes #12341 from seyfe/spark_console_display_webui_link.
*	[SPARK-14654][CORE] New accumulator API	Wenchen Fan	2016-04-28	42	-587/+905
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR introduces a new accumulator API which is much simpler than before: 1. the type hierarchy is simplified, now we only have an `Accumulator` class 2. Combine `initialValue` and `zeroValue` concepts into just one concept: `zeroValue` 3. there in only one `register` method, the accumulator registration and cleanup registration are combined. 4. the `id`,`name` and `countFailedValues` are combined into an `AccumulatorMetadata`, and is provided during registration. `SQLMetric` is a good example to show the simplicity of this new API. What we break: 1. no `setValue` anymore. In the new API, the intermedia type can be different from the result type, it's very hard to implement a general `setValue` 2. accumulator can't be serialized before registered. Problems need to be addressed in follow-ups: 1. with this new API, `AccumulatorInfo` doesn't make a lot of sense, the partial output is not partial updates, we need to expose the intermediate value. 2. `ExceptionFailure` should not carry the accumulator updates. Why do users care about accumulator updates for failed cases? It looks like we only use this feature to update the internal metrics, how about we sending a heartbeat to update internal metrics after the failure event? 3. the public event `SparkListenerTaskEnd` carries a `TaskMetrics`. Ideally this `TaskMetrics` don't need to carry external accumulators, as the only method of `TaskMetrics` that can access external accumulators is `private[spark]`. However, `SQLListener` use it to retrieve sql metrics. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #12612 from cloud-fan/acc.
*	[SPARK-10001][CORE] Don't short-circuit actions in signal handlers	Jakob Odersky	2016-04-27	1	-3/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The current signal handlers have a subtle bug that stops evaluating registered actions as soon as one of them returns true, this is because `forall` is short-circuited. This PR adds a strict mapping stage before evaluating returned result. There are no known occurrences of the bug and this is a preemptive fix. ## How was this patch tested? As with the original introduction of signal handlers, this was tested manually (unit testing with signals is not straightforward). Author: Jakob Odersky <jakob@odersky.com> Closes #12745 from jodersky/SPARK-10001-hotfix.
*	[SPARK-14966] SizeEstimator should ignore classes in the scala.reflect package	Josh Rosen	2016-04-27	1	-0/+3
\| \| \| \| \| \| \| \| \| \|	In local profiling, I noticed SizeEstimator spending tons of time estimating the size of objects which contain TypeTag or ClassTag fields. The problem with these tags is that they reference global Scala reflection objects, which, in turn, reference many singletons, such as TestHive. This throws off the accuracy of the size estimation and wastes tons of time traversing a huge object graph. As a result, I think that SizeEstimator should ignore any classes in the `scala.reflect` package. Author: Josh Rosen <joshrosen@databricks.com> Closes #12741 from JoshRosen/ignore-scala-reflect-in-size-estimator.
*	[SPARK-14729][SCHEDULER] Refactored YARN scheduler creation code to use ↵	Hemant Bhanawat	2016-04-27	2	-67/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	newly added ExternalClusterManager ## What changes were proposed in this pull request? With the addition of ExternalClusterManager(ECM) interface in PR #11723, any cluster manager can now be integrated with Spark. It was suggested in ExternalClusterManager PR that one of the existing cluster managers should start using the new interface to ensure that the API is correct. Ideally, all the existing cluster managers should eventually use the ECM interface but as a first step yarn will now use the ECM interface. This PR refactors YARN code from SparkContext.createTaskScheduler function into YarnClusterManager that implements ECM interface. ## How was this patch tested? Since this is refactoring, no new tests has been added. Existing tests have been run. Basic manual testing with YARN was done too. Author: Hemant Bhanawat <hemant@snappydata.io> Closes #12641 from hbhanawat/yarnClusterMgr.