spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-9043] Serialize key, value and combiner classes in ShuffleDependency	Matt Massie	2015-09-10	8	-22/+167
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ShuffleManager implementations are currently not given type information for the key, value and combiner classes. Serialization of shuffle objects relies on objects being JavaSerializable, with methods defined for reading/writing the object or, alternatively, serialization via Kryo which uses reflection. Serialization systems like Avro, Thrift and Protobuf generate classes with zero argument constructors and explicit schema information (e.g. IndexedRecords in Avro have get, put and getSchema methods). By serializing the key, value and combiner class names in ShuffleDependency, shuffle implementations will have access to schema information when registerShuffle() is called. Author: Matt Massie <massie@cs.berkeley.edu> Closes #7403 from massie/shuffle-classtags.
*	[SPARK-10049] [SPARKR] Support collecting data of ArraryType in DataFrame.	Sun Rui	2015-09-10	2	-85/+145
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	this PR : 1. Enhance reflection in RBackend. Automatically matching a Java array to Scala Seq when finding methods. Util functions like seq(), listToSeq() in R side can be removed, as they will conflict with the Serde logic that transferrs a Scala seq to R side. 2. Enhance the SerDe to support transferring a Scala seq to R side. Data of ArrayType in DataFrame after collection is observed to be of Scala Seq type. 3. Support ArrayType in createDataFrame(). Author: Sun Rui <rui.sun@intel.com> Closes #8458 from sun-rui/SPARK-10049.
*	[SPARK-10514] [MESOS] waiting for min no of total cores acquired by Spark by ↵	Akash Mishra	2015-09-10	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	implementing the sufficientResourcesRegistered method spark.scheduler.minRegisteredResourcesRatio configuration parameter works for YARN mode but not for Mesos Coarse grained mode. If the parameter specified default value of 0 will be set for spark.scheduler.minRegisteredResourcesRatio in base class and this method will always return true. There are no existing test for YARN mode too. Hence not added test for the same. Author: Akash Mishra <akash.mishra20@gmail.com> Closes #8672 from SleepyThread/master.
*	[SPARK-6350] [MESOS] Fine-grained mode scheduler respects mesosExecutor.cores	Iulian Dragos	2015-09-10	2	-3/+33
\| \| \| \| \| \| \| \| \| \|	This is a regression introduced in #4960, this commit fixes it and adds a test. tnachen andrewor14 please review, this should be an easy one. Author: Iulian Dragos <jaguarul@gmail.com> Closes #8653 from dragos/issue/mesos/fine-grained-maxExecutorCores.
*	[SPARK-8167] Make tasks that fail from YARN preemption not fail job	mcheah	2015-09-10	15	-55/+186
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The architecture is that, in YARN mode, if the driver detects that an executor has disconnected, it asks the ApplicationMaster why the executor died. If the ApplicationMaster is aware that the executor died because of preemption, all tasks associated with that executor are not marked as failed. The executor is still removed from the driver's list of available executors, however. There's a few open questions: 1. Should standalone mode have a similar "get executor loss reason" as well? I localized this change as much as possible to affect only YARN, but there could be a valid case to differentiate executor losses in standalone mode as well. 2. I make a pretty strong assumption in YarnAllocator that getExecutorLossReason(executorId) will only be called once per executor id; I do this so that I can remove the metadata from the in-memory map to avoid object accumulation. It's not clear if I'm being overly zealous to save space, however. cc vanzin specifically for review because it collided with some earlier YARN scheduling work. cc JoshRosen because it's similar to output commit coordination we did in the past cc andrewor14 for our discussion on how to get executor exit codes and loss reasons Author: mcheah <mcheah@palantir.com> Closes #8007 from mccheah/feature/preemption-handling.
*	[SPARK-10466] [SQL] UnsafeRow SerDe exception with data spill	Cheng Hao	2015-09-10	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Data Spill with UnsafeRow causes assert failure. ``` java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75) at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) ``` To reproduce that with code (thanks andrewor14): ```scala bin/spark-shell --master local --conf spark.shuffle.memoryFraction=0.005 --conf spark.shuffle.sort.bypassMergeThreshold=0 sc.parallelize(1 to 2 * 1000 * 1000, 10) .map { i => (i, i) }.toDF("a", "b").groupBy("b").avg().count() ``` Author: Cheng Hao <hao.cheng@intel.com> Closes #8635 from chenghao-intel/unsafe_spill.
*	[SPARK-9730] [SQL] Add Full Outer Join support for SortMergeJoin	Liang-Chi Hsieh	2015-09-09	1	-0/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR is based on #8383 , thanks to viirya JIRA: https://issues.apache.org/jira/browse/SPARK-9730 This patch adds the Full Outer Join support for SortMergeJoin. A new class SortMergeFullJoinScanner is added to scan rows from left and right iterators. FullOuterIterator is simply a wrapper of type RowIterator to consume joined rows from SortMergeFullJoinScanner. Closes #8383 Author: Liang-Chi Hsieh <viirya@appier.com> Author: Davies Liu <davies@databricks.com> Closes #8579 from davies/smj_fullouter.
*	[SPARK-10227] fatal warnings with sbt on Scala 2.11	Luc Bourlier	2015-09-09	32	-86/+93
\| \| \| \| \| \| \| \| \| \| \|	The bulk of the changes are on `transient` annotation on class parameter. Often the compiler doesn't generate a field for this parameters, so the the transient annotation would be unnecessary. But if the class parameter are used in methods, then fields are created. So it is safer to keep the annotations. The remainder are some potential bugs, and deprecated syntax. Author: Luc Bourlier <luc.bourlier@typesafe.com> Closes #8433 from skyluc/issue/sbt-2.11.
*	[SPARK-9767] Remove ConnectionManager.	Reynold Xin	2015-09-07	18	-3213/+15
\| \| \| \| \| \| \| \|	We introduced the Netty network module for shuffle in Spark 1.2, and has turned it on by default for 3 releases. The old ConnectionManager is difficult to maintain. If we merge the patch now, by the time it is released, it would be 1 yr for which ConnectionManager is off by default. It's time to remove it. Author: Reynold Xin <rxin@databricks.com> Closes #8161 from rxin/SPARK-9767.
*	[SPARK-10454] [SPARK CORE] wait for empty event queue	robbins	2015-09-04	1	-0/+1
\| \| \| \| \| \|	Author: robbins <robbins@uk.ibm.com> Closes #8605 from robbinspg/DAGSchedulerSuite-fix.
*	[SPARK-9669] [MESOS] Support PySpark on Mesos cluster mode.	Timothy Chen	2015-09-04	2	-16/+39
\| \| \| \| \| \| \| \| \|	Support running pyspark with cluster mode on Mesos! This doesn't upload any scripts, so if running in a remote Mesos requires the user to specify the script from a available URI. Author: Timothy Chen <tnachen@gmail.com> Closes #8349 from tnachen/mesos_python.
*	[SPARK-10003] Improve readability of DAGScheduler	Andrew Or	2015-09-03	1	-37/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Note: this is not intended to be in Spark 1.5! This patch rewrites some code in the `DAGScheduler` to make it more readable. In particular - there were blocks of code that are unnecessary and removed for simplicity - there were abstractions that are unnecessary and made the code hard to navigate - other minor changes Author: Andrew Or <andrew@databricks.com> Closes #8217 from andrewor14/dag-scheduler-readability and squashes the following commits: 57abca3 [Andrew Or] Move comment back into if case 574fb1e [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-scheduler-readability 64a9ed2 [Andrew Or] Remove unnecessary code + minor code rewrites
*	[SPARK-10421] [BUILD] Exclude curator artifacts from tachyon dependencies.	Marcelo Vanzin	2015-09-03	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \|	This avoids them being mistakenly pulled instead of the newer ones that Spark actually uses. Spark only depends on these artifacts transitively, so sometimes maven just decides to pick tachyon's version of the dependency for whatever reason. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8577 from vanzin/SPARK-10421.
*	[SPARK-10435] Spark submit should fail fast for Mesos cluster mode with R	Andrew Or	2015-09-03	1	-0/+3
\| \| \| \| \| \| \| \|	It's not supported yet so we should error with a clear message. Author: Andrew Or <andrew@databricks.com> Closes #8590 from andrewor14/mesos-cluster-r-guard.
*	[SPARK-9591] [CORE] Job may fail for exception during getting remote block	jeanlyn	2015-09-03	3	-2/+80
\| \| \| \| \| \| \| \| \|	[SPARK-9591](https://issues.apache.org/jira/browse/SPARK-9591) When we getting the broadcast variable, we can fetch the block form several location,but now when connecting the lost blockmanager(idle for enough time removed by driver when using dynamic resource allocate and so on) will cause task fail,and the worse case will cause the job fail. Author: jeanlyn <jeanlyn92@gmail.com> Closes #7927 from jeanlyn/catch_exception.
*	[SPARK-10430] [CORE] Added hashCode methods in AccumulableInfo and ↵	Vinod K C	2015-09-03	4	-1/+26
\| \| \| \| \| \| \| \|	RDDOperationScope Author: Vinod K C <vinod.kc@huawei.com> Closes #8581 from vinodkc/fix_RDDOperationScope_Hashcode.
*	[SPARK-9672] [MESOS] Don’t include SPARK_ENV_LOADED when passing env vars	Pat Shields	2015-09-03	2	-4/+25
\| \| \| \| \| \| \| \|	This contribution is my original work and I license the work to the project under the project's open source license. Author: Pat Shields <yeoldefortran@gmail.com> Closes #7979 from pashields/env-loading-on-driver.
*	[SPARK-10431] [CORE] Fix intermittent test failure. Wait for event queue to ↵	robbins	2015-09-03	1	-0/+4
\| \| \| \| \| \| \| \|	be clear Author: robbins <robbins@uk.ibm.com> Closes #8582 from robbinspg/InputOutputMetricsSuite.
*	[SPARK-8951] [SPARKR] support Unicode characters in collect()	CHOIJAEHONG	2015-09-03	1	-5/+4
\| \| \| \| \| \| \| \| \|	Spark gives an error message and does not show the output when a field of the result DataFrame contains characters in CJK. I changed SerDe.scala in order that Spark support Unicode characters when writes a string to R. Author: CHOIJAEHONG <redrock07@naver.com> Closes #7494 from CHOIJAEHONG1/SPARK-8951.
*	[SPARK-10379] preserve first page in UnsafeShuffleExternalSorter	Davies Liu	2015-09-02	3	-3/+8
\| \| \| \| \| \|	Author: Davies Liu <davies@databricks.com> Closes #8543 from davies/preserve_page.
*	[SPARK-10247] [CORE] improve readability of a test case in DAGSchedulerSuite	Imran Rashid	2015-09-02	1	-10/+47
\| \| \| \| \| \| \| \|	This is pretty minor, just trying to improve the readability of `DAGSchedulerSuite`, I figure every bit helps. Before whenever I read this test, I never knew what "should work" and "should be ignored" really meant -- this adds some asserts & updates comments to make it more clear. Also some reformatting per a suggestion from markhamstra on https://github.com/apache/spark/pull/7699 Author: Imran Rashid <irashid@cloudera.com> Closes #8434 from squito/SPARK-10247.
*	Removed code duplication in ShuffleBlockFetcherIterator	Evan Racah	2015-09-02	1	-8/+10
\| \| \| \| \| \| \| \|	Added fetchUpToMaxBytes() to prevent having to update both code blocks when a change is made. Author: Evan Racah <ejracah@gmail.com> Closes #8514 from eracah/master.
*	[SPARK-8707] RDD#toDebugString fails if any cached RDD has invalid partitions	navis.ryu	2015-09-02	2	-2/+6
\| \| \| \| \| \| \| \|	Added numPartitions(evaluate: Boolean) to RDD. With "evaluate=true" the method is same with "partitions.length". With "evaluate=false", it checks checked-out or already evaluated partitions in the RDD to get number of partition. If it's not those cases, returns -1. RDDInfo.partitionNum calls numPartition only when it's accessed. Author: navis.ryu <navis@apache.org> Closes #7127 from navis/SPARK-8707.
*	[SPARK-5945] Spark should not retry a stage infinitely on a FetchFailedException	Ilya Ganelin	2015-09-02	3	-5/+320
\| \| \| \| \| \| \| \| \| \| \| \|	The ```Stage``` class now tracks whether there were a sufficient number of consecutive failures of that stage to trigger an abort. To avoid an infinite loop of stage retries, we abort the job completely after 4 consecutive stage failures for one stage. We still allow more than 4 consecutive stage failures if there is an intervening successful attempt for the stage, so that in very long-lived applications, where a stage may get reused many times, we don't abort the job after failures that have been recovered from successfully. I've added test cases to exercise the most obvious scenarios. Author: Ilya Ganelin <ilya.ganelin@capitalone.com> Closes #5636 from ilganeli/SPARK-5945.
*	[SPARK-10004] [SHUFFLE] Perform auth checks when clients read shuffle data.	Marcelo Vanzin	2015-09-02	2	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To correctly isolate applications, when requests to read shuffle data arrive at the shuffle service, proper authorization checks need to be performed. This change makes sure that only the application that created the shuffle data can read from it. Such checks are only enabled when "spark.authenticate" is enabled, otherwise there's no secure way to make sure that the client is really who it says it is. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8218 from vanzin/SPARK-10004.
*	[SPARK-7336] [HISTORYSERVER] Fix bug that applications status incorrect on ↵	Chuan Shao	2015-09-02	1	-5/+22
\| \| \| \| \| \| \| \|	JobHistory UI. Author: ArcherShao <shaochuan@huawei.com> Closes #5886 from ArcherShao/SPARK-7336.
*	[SPARK-4223] [CORE] Support * in acls.	zhuol	2015-09-01	2	-4/+63
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SPARK-4223. Currently we support setting view and modify acls but you have to specify a list of users. It would be nice to support * meaning all users have access. Manual tests to verify that: "*" works for any user in: a. Spark ui: view and kill stage. Done. b. Spark history server. Done. c. Yarn application killing. Done. Author: zhuol <zhuol@yahoo-inc.com> Closes #8398 from zhuoliu/4223.
*	[SPARK-10341] [SQL] fix memory starving in unsafe SMJ	Davies Liu	2015-08-31	3	-6/+42
\| \| \| \| \| \| \| \| \| \| \| \|	In SMJ, the first ExternalSorter could consume all the memory before spilling, then the second can not even acquire the first page. Before we have a better memory allocator, SMJ should call prepare() before call any compute() of it's children. cc rxin JoshRosen Author: Davies Liu <davies@databricks.com> Closes #8511 from davies/smj_memory.
*	[SPARK-8730] Fixes - Deser objects containing a primitive class attribute	EugenCepoi	2015-08-31	2	-5/+40
\| \| \| \| \| \|	Author: EugenCepoi <cepoi.eugen@gmail.com> Closes #7122 from EugenCepoi/master.
*	[SPARK-10184] [CORE] Optimization for bounds determination in RangePartitioner	ihainan	2015-08-30	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	JIRA Issue: https://issues.apache.org/jira/browse/SPARK-10184 Change `cumWeight > target` to `cumWeight >= target` in `RangePartitioner.determineBounds` method to make the output partitions more balanced. Author: ihainan <ihainan72@gmail.com> Closes #8397 from ihainan/opt_for_rangepartitioner.
*	[SPARK-9284] [TESTS] Allow all tests to run without an assembly.	Marcelo Vanzin	2015-08-28	2	-2/+152
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This change aims at speeding up the dev cycle a little bit, by making sure that all tests behave the same w.r.t. where the code to be tested is loaded from. Namely, that means that tests don't rely on the assembly anymore, rather loading all needed classes from the build directories. The main change is to make sure all build directories (classes and test-classes) are added to the classpath of child processes when running tests. YarnClusterSuite required some custom code since the executors are run differently (i.e. not through the launcher library, like standalone and Mesos do). I also found a couple of tests that could leak a SparkContext on failure, and added code to handle those. With this patch, it's possible to run the following command from a clean source directory and have all tests pass: mvn -Pyarn -Phadoop-2.4 -Phive-thriftserver install Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #7629 from vanzin/SPARK-9284.
*	[SPARK-10295] [CORE] Dynamic allocation in Mesos does not release when RDDs ↵	Sean Owen	2015-08-28	1	-5/+0
\| \| \| \| \| \| \| \| \| \| \| \|	are cached Remove obsolete warning about dynamic allocation not working with cached RDDs See discussion in https://issues.apache.org/jira/browse/SPARK-10295 Author: Sean Owen <sowen@cloudera.com> Closes #8489 from srowen/SPARK-10295.
*	[SPARK-10251] [CORE] some common types are not registered for Kryo Serializat…	Ram Sriharsha	2015-08-26	2	-1/+64
\| \| \| \| \| \| \| \|	…ion by default Author: Ram Sriharsha <rsriharsha@hw11853.local> Closes #8465 from harsha2010/SPARK-10251.
*	[SPARK-10048] [SPARKR] Support arbitrary nested Java array in serde.	Sun Rui	2015-08-25	2	-35/+58
\| \| \| \| \| \| \| \| \| \| \|	This PR: 1. supports transferring arbitrary nested array from JVM to R side in SerDe; 2. based on 1, collect() implemenation is improved. Now it can support collecting data of complex types from a DataFrame. Author: Sun Rui <rui.sun@intel.com> Closes #8276 from sun-rui/SPARK-10048.
*	[SPARK-9613] [CORE] Ban use of JavaConversions and migrate all existing uses ↵	Sean Owen	2015-08-25	74	-377/+342
\| \| \| \| \| \| \| \| \| \| \| \|	to JavaConverters Replace `JavaConversions` implicits with `JavaConverters` Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet. Author: Sean Owen <sowen@cloudera.com> Closes #8033 from srowen/SPARK-9613.
*	Fixed a typo in DAGScheduler.	ehnalis	2015-08-25	1	-7/+20
\| \| \| \| \| \|	Author: ehnalis <zoltan.zvara@gmail.com> Closes #8308 from ehnalis/master.
*	[DOC] add missing parameters in SparkContext.scala for scala doc	Zhang, Liye	2015-08-25	1	-1/+14
\| \| \| \| \| \|	Author: Zhang, Liye <liye.zhang@intel.com> Closes #8412 from liyezhang556520/minorDoc.
*	[SPARK-10144] [UI] Actually show peak execution memory by default	Andrew Or	2015-08-24	2	-6/+8
\| \| \| \| \| \| \| \|	The peak execution memory metric was introduced in SPARK-8735. That was before Tungsten was enabled by default, so it assumed that `spark.sql.unsafe.enabled` must be explicitly set to true. The result is that the memory is not displayed by default. Author: Andrew Or <andrew@databricks.com> Closes #8345 from andrewor14/show-memory-default.
*	[SPARK-9439] [YARN] External shuffle service robust to NM restarts using leveldb	Imran Rashid	2015-08-21	4	-5/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-9439 In general, Yarn apps should be robust to NodeManager restarts. However, if you run spark with the external shuffle service on, after a NM restart all shuffles fail, b/c the shuffle service has lost some state with info on each executor. (Note the shuffle data is perfectly fine on disk across a NM restart, the problem is we've lost the small bit of state that lets us find those files.) The solution proposed here is that the external shuffle service can write out its state to leveldb (backed by a local file) every time an executor is added. When running with yarn, that file is in the NM's local dir. Whenever the service is started, it looks for that file, and if it exists, it reads the file and re-registers all executors there. Nothing is changed in non-yarn modes with this patch. The service is not given a place to save the state to, so it operates the same as before. This should make it easy to update other cluster managers as well, by just supplying the right file & the equivalent of yarn's `initializeApplication` -- I'm not familiar enough with those modes to know how to do that. Author: Imran Rashid <irashid@cloudera.com> Closes #7943 from squito/leveldb_external_shuffle_service_NM_restart and squashes the following commits: 0d285d3 [Imran Rashid] review feedback 70951d6 [Imran Rashid] Merge branch 'master' into leveldb_external_shuffle_service_NM_restart 5c71c8c [Imran Rashid] save executor to db before registering; style 2499c8c [Imran Rashid] explicit dependency on jackson-annotations 795d28f [Imran Rashid] review feedback 81f80e2 [Imran Rashid] Merge branch 'master' into leveldb_external_shuffle_service_NM_restart 594d520 [Imran Rashid] use json to serialize application executor info 1a7980b [Imran Rashid] version 8267d2a [Imran Rashid] style e9f99e8 [Imran Rashid] cleanup the handling of bad dbs a little 9378ba3 [Imran Rashid] fail gracefully on corrupt leveldb files acedb62 [Imran Rashid] switch to writing out one record per executor 79922b7 [Imran Rashid] rely on yarn to call stopApplication; assorted cleanup 12b6a35 [Imran Rashid] save registered executors when apps are removed; add tests c878fbe [Imran Rashid] better explanation of shuffle service port handling 694934c [Imran Rashid] only open leveldb connection once per service d596410 [Imran Rashid] store executor data in leveldb 59800b7 [Imran Rashid] Files.move in case renaming is unsupported 32fe5ae [Imran Rashid] Merge branch 'master' into external_shuffle_service_NM_restart d7450f0 [Imran Rashid] style f729e2b [Imran Rashid] debugging 4492835 [Imran Rashid] lol, dont use a PrintWriter b/c of scalastyle checks 0a39b98 [Imran Rashid] Merge branch 'master' into external_shuffle_service_NM_restart 55f49fc [Imran Rashid] make sure the service doesnt die if the registered executor file is corrupt; add tests 245db19 [Imran Rashid] style 62586a6 [Imran Rashid] just serialize the whole executors map bdbbf0d [Imran Rashid] comments, remove some unnecessary changes 857331a [Imran Rashid] better tests & comments bb9d1e6 [Imran Rashid] formatting bdc4b32 [Imran Rashid] rename 86e0cb9 [Imran Rashid] for tests, shuffle service finds an open port 23994ff [Imran Rashid] style 7504de8 [Imran Rashid] style a36729c [Imran Rashid] cleanup efb6195 [Imran Rashid] proper unit test, and no longer leak if apps stop during NM restart dd93dc0 [Imran Rashid] test for shuffle service w/ NM restarts d596969 [Imran Rashid] cleanup imports 0e9d69b [Imran Rashid] better names 9eae119 [Imran Rashid] cleanup lots of duplication 1136f44 [Imran Rashid] test needs to have an actual shuffle 0b588bd [Imran Rashid] more fixes ... ad122ef [Imran Rashid] more fixes 5e5a7c3 [Imran Rashid] fix build c69f46b [Imran Rashid] maybe working version, needs tests & cleanup ... bb3ba49 [Imran Rashid] minor cleanup 36127d3 [Imran Rashid] wip b9d2ced [Imran Rashid] incomplete setup for external shuffle service tests
*	[SPARK-10140] [DOC] add target fields to @Since	Xiangrui Meng	2015-08-20	1	-0/+2
\| \| \| \| \| \| \| \|	so constructors parameters and public fields can be annotated. rxin MechCoder Author: Xiangrui Meng <meng@databricks.com> Closes #8344 from mengxr/SPARK-10140.2.
*	[SPARK-9982] [SPARKR] SparkR DataFrame fail to return data of Decimal type	Alex Shkurenko	2015-08-20	1	-0/+5
\| \| \| \| \| \|	Author: Alex Shkurenko <ashkurenko@enova.com> Closes #8239 from ashkurenko/master.
*	[SPARK-10124] [MESOS] Fix removing queued driver in mesos cluster mode.	Timothy Chen	2015-08-19	1	-8/+11
\| \| \| \| \| \| \| \| \| \|	Currently the spark applications can be queued to the Mesos cluster dispatcher, but when multiple jobs are in queue we don't handle removing jobs from the buffer correctly while iterating and causes null pointer exception. This patch copies the buffer before iterating them, so exceptions aren't thrown when the jobs are removed. Author: Timothy Chen <tnachen@gmail.com> Closes #8322 from tnachen/fix_cluster_mode.
*	[SPARK-10119] [CORE] Fix isDynamicAllocationEnabled when config is ↵	Marcelo Vanzin	2015-08-19	2	-1/+15
\| \| \| \| \| \| \| \|	expliticly disabled. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8316 from vanzin/SPARK-10119.
*	[SPARK-8889] [CORE] Fix for OOM for graph creation	Joshi	2015-08-19	2	-11/+51
\| \| \| \| \| \| \| \| \|	Fix for OOM for graph creation Author: Joshi <rekhajoshm@gmail.com> Author: Rekha Joshi <rekhajoshm@gmail.com> Closes #7602 from rekhajoshm/SPARK-8889.
*	[SPARK-9856] [SPARKR] Add expression functions into SparkR whose params are ↵	Yu ISHIKAWA	2015-08-19	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	complicated I added lots of Column functinos into SparkR. And I also added `rand(seed: Int)` and `randn(seed: Int)` in Scala. Since we need such APIs for R integer type. ### JIRA [[SPARK-9856] Add expression functions into SparkR whose params are complicated - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9856) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8264 from yu-iskw/SPARK-9856-3.
*	[SPARK-8949] Print warnings when using preferred locations feature	Han JU	2015-08-19	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \|	Add warnings according to SPARK-8949 in `SparkContext` - warnings in scaladoc - log warnings when preferred locations feature is used through `SparkContext`'s constructor However I didn't found any documentation reference of this feature. Please direct me if you know any reference to this feature. Author: Han JU <ju.han.felix@gmail.com> Closes #7874 from darkjh/SPARK-8949.
*	[SPARK-9967] [SPARK-10099] [STREAMING] Renamed conf ↵	Tathagata Das	2015-08-18	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	spark.streaming.backpressure.{enable-->enabled} and fixed deprecated annotations Small changes - Renamed conf spark.streaming.backpressure.{enable --> enabled} - Change Java Deprecated annotations to Scala deprecated annotation with more information. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8299 from tdas/SPARK-9967.
*	[SPARK-9952] Fix N^2 loop when DAGScheduler.getPreferredLocsInternal ↵	Josh Rosen	2015-08-18	4	-16/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	accesses cacheLocs In Scala, `Seq.fill` always seems to return a List. Accessing a list by index is an O(N) operation. Thus, the following code will be really slow (~10 seconds on my machine): ```scala val numItems = 100000 val s = Seq.fill(numItems)(1) for (i <- 0 until numItems) s(i) ``` It turns out that we had a loop like this in DAGScheduler code, although it's a little tricky to spot. In `getPreferredLocsInternal`, there's a call to `getCacheLocs(rdd)(partition)`. The `getCacheLocs` call returns a Seq. If this Seq is a List and the RDD contains many partitions, then indexing into this list will cost O(partitions). Thus, when we loop over our tasks to compute their individual preferred locations we implicitly perform an N^2 loop, reducing scheduling throughput. This patch fixes this by replacing `Seq` with `Array`. Author: Josh Rosen <joshrosen@databricks.com> Closes #8178 from JoshRosen/dagscheduler-perf.
*	[SPARK-7736] [CORE] Fix a race introduced in PythonRunner.	Marcelo Vanzin	2015-08-18	1	-1/+7
\| \| \| \| \| \| \| \| \| \|	The fix for SPARK-7736 introduced a race where a port value of "-1" could be passed down to the pyspark process, causing it to fail to connect back to the JVM. This change adds code to fix that race. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8258 from vanzin/SPARK-7736.
*	[MINOR] fix the comments in IndexShuffleBlockResolver	CodingCat	2015-08-18	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	it might be a typo introduced at the first moment or some leftover after some renaming...... the name of the method accessing the index file is called `getBlockData` now (not `getBlockLocation` as indicated in the comments) Author: CodingCat <zhunansjtu@gmail.com> Closes #8238 from CodingCat/minor_1.