spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	[SPARK-14449][SQL] SparkContext should use SparkListenerInterface	Michael Armbrust	2016-04-07	2	-5/+10
\| \| \| \| \| \| \| \| \| \| \| \|	Currently all `SparkFirehoseListener` implementations are broken since we expect listeners to extend `SparkListener`, while the fire hose only extends `SparkListenerInterface`. This changes the addListener function and the config based injection to use the interface instead. The existing tests in SparkListenerSuite are improved such that they would have caught this. Follow-up to #12142 Author: Michael Armbrust <michael@databricks.com> Closes #12227 from marmbrus/fixListener.
*	[SPARK-14468] Always enable OutputCommitCoordinator	Andrew Or	2016-04-07	1	-10/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `OutputCommitCoordinator` was introduced to deal with concurrent task attempts racing to write output, leading to data loss or corruption. For more detail, read the [JIRA description](https://issues.apache.org/jira/browse/SPARK-14468). Before: `OutputCommitCoordinator` is enabled only if speculation is enabled. After: `OutputCommitCoordinator` is always enabled. Users may still disable this through `spark.hadoop.outputCommitCoordination.enabled`, but they really shouldn't... ## How was this patch tested? `OutputCommitCoordinator*Suite` Author: Andrew Or <andrew@databricks.com> Closes #12244 from andrewor14/always-occ.
*	[SPARK-12384] Enables spark-clients to set the min(-Xms) and max(*.memory ↵	Dhruve Ashar	2016-04-07	2	-4/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	config) j… ## What changes were proposed in this pull request? Currently Spark clients are started with the same memory setting for Xms and Xms leading to reserving unnecessary higher amounts of memory. This behavior is changed and the clients can now specify an initial heap size using the extraJavaOptions in the config for driver,executor and am individually. Note, that only -Xms can be provided through this config option, if the client wants to set the max size(-Xmx), this has to be done via the *.memory configuration knobs which are currently supported. ## How was this patch tested? Monitored executor and yarn logs in debug mode to verify the commands through which they are being launched in client and cluster mode. The driver memory was verified locally using jps -v. Setting up -Xmx parameter in the javaExtraOptions raises exception with the info provided. Author: Dhruve Ashar <dhruveashar@gmail.com> Closes #12115 from dhruve/impr/SPARK-12384.
*	[SPARK-14245][WEB UI] Display the user in the application view	Alex Bozarth	2016-04-07	3	-0/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The Spark UI (both active and history) should show the user who ran the application somewhere when you are in the application view. This was added under the Jobs view by total uptime and scheduler mode. ## How was this patch tested? Manual testing <img width="191" alt="username" src="https://cloud.githubusercontent.com/assets/13952758/14222830/6d1fe542-f82a-11e5-885f-c05ee2cdf857.png"> Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #12123 from ajbozarth/spark14245.
*	[SPARK-13112][CORE] Make sure RegisterExecutorResponse arrive before LaunchTask	Shixiong Zhu	2016-04-06	2	-6/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Send `RegisterExecutorResponse` using `executorRef` in order to make sure RegisterExecutorResponse and LaunchTask are both sent using the same channel. Then RegisterExecutorResponse will always arrive before LaunchTask ## How was this patch tested? Existing unit tests Closes #12078 Author: Shixiong Zhu <shixiong@databricks.com> Closes #12211 from zsxwing/SPARK-13112.
*	[SPARK-14444][BUILD] Add a new scalastyle `NoScalaDoc` to prevent ↵	Dongjoon Hyun	2016-04-06	3	-10/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ScalaDoc-style multiline comments ## What changes were proposed in this pull request? According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Indentation), this PR adds a new scalastyle rule to prevent the followings. ``` /** In Spark, we don't use the ScalaDoc style so this * is not correct. */ ``` ## How was this patch tested? Pass the Jenkins tests (including `lint-scala`). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12221 from dongjoon-hyun/SPARK-14444.
*	[SPARK-12133][STREAMING] Streaming dynamic allocation	Tathagata Das	2016-04-06	3	-0/+18
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Added a new Executor Allocation Manager for the Streaming scheduler for doing Streaming Dynamic Allocation. ## How was this patch tested Unit tests, and cluster tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #12154 from tdas/streaming-dynamic-allocation.
*	[SPARK-14252] Executors do not try to download remote cached blocks	Eric Liang	2016-04-05	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? As mentioned in the ticket this was because one get path in the refactored `BlockManager` did not check for remote storage. ## How was this patch tested? Unit test, also verified manually with reproduction in the ticket. cc JoshRosen Author: Eric Liang <ekl@databricks.com> Closes #12193 from ericl/spark-14252.
*	[SPARK-14416][CORE] Add thread-safe comments for ↵	Shixiong Zhu	2016-04-05	1	-13/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	CoarseGrainedSchedulerBackend's fields ## What changes were proposed in this pull request? While I was reviewing #12078, I found most of CoarseGrainedSchedulerBackend's mutable fields doesn't have any comments about the thread-safe assumptions and it's hard for people to figure out which part of codes should be protected by the lock. This PR just added comments/annotations for them and also added strict access modifiers for some fields. ## How was this patch tested? Existing unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #12188 from zsxwing/comments.
*	[SPARK-529][SQL] Modify SQLConf to use new config API from core.	Marcelo Vanzin	2016-04-05	2	-33/+55
\| \| \| \| \| \| \| \| \| \| \| \|	Because SQL keeps track of all known configs, some customization was needed in SQLConf to allow that, since the core API does not have that feature. Tested via existing (and slightly updated) unit tests. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11570 from vanzin/SPARK-529-sql.
*	[SPARK-14397][WEBUI] <html> and <body> tags are nested in LogPage	Kousuke Saruta	2016-04-05	1	-14/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? In `LogPage`, the content to be rendered is defined as follows. ``` val content = <html> <body> {linkToMaster} <div> <div style="float:left; margin-right:10px">{backButton}</div> <div style="float:left;">{range}</div> <div style="float:right; margin-left:10px">{nextButton}</div> </div> <br /> <div style="height:500px; overflow:auto; padding:5px;"> <pre>{logText}</pre> </div> </body> </html> UIUtils.basicSparkPage(content, logType + " log page for " + pageName) ``` As you can see, <html> and <body> tags will be rendered. On the other hand, `UIUtils.basicSparkPage` will render those tags so those tags will be nested. ``` def basicSparkPage( content: => Seq[Node], title: String, useDataTables: Boolean = false): Seq[Node] = { <html> <head> {commonHeaderNodes} {if (useDataTables) dataTablesHeaderNodes else Seq.empty} <title>{title}</title> </head> <body> <div class="container-fluid"> <div class="row-fluid"> <div class="span12"> <h3 style="vertical-align: middle; display: inline-block;"> <a style="text-decoration: none" href={prependBaseUri("/")}> <img src={prependBaseUri("/static/spark-logo-77x50px-hd.png")} /> <span class="version" style="margin-right: 15px;">{org.apache.spark.SPARK_VERSION}</span> </a> {title} </h3> </div> </div> {content} </div> </body> </html> } ``` These are the screen shots before this patch is applied. ![before1](https://cloud.githubusercontent.com/assets/4736016/14273236/03cbed8a-fb44-11e5-8786-bc1bfa4d3f8c.png) ![before2](https://cloud.githubusercontent.com/assets/4736016/14273237/03d1741c-fb44-11e5-9dee-ea93022033a6.png) And these are the ones after this patch is applied. ![after1](https://cloud.githubusercontent.com/assets/4736016/14273248/1b6a7d8a-fb44-11e5-8a3b-69964f3434f6.png) ![after2](https://cloud.githubusercontent.com/assets/4736016/14273249/1b6b9c38-fb44-11e5-9d6f-281d64c842e4.png) The appearance is not changed but the html source code is changed. ## How was this patch tested? Manually run some jobs on my standalone-cluster and check the WebUI. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #12170 from sarutak/SPARK-14397.
*	[SPARK-12425][STREAMING] DStream union optimisation	Guillaume Poulin	2016-04-05	1	-5/+1
\| \| \| \| \| \| \| \| \|	Use PartitionerAwareUnionRDD when possbile for optimizing shuffling and preserving the partitioner. Author: Guillaume Poulin <poulin.guillaume@gmail.com> Closes #10382 from gpoulin/dstream_union_optimisation.
*	[SPARK-13579][BUILD] Stop building the main Spark assembly.	Marcelo Vanzin	2016-04-04	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This change modifies the "assembly/" module to just copy needed dependencies to its build directory, and modifies the packaging script to pick those up (and remove duplicate jars packages in the examples module). I also made some minor adjustments to dependencies to remove some test jars from the final packaging, and remove jars that conflict with each other when packaged separately (e.g. servlet api). Also note that this change restores guava in applications' classpaths, even though it's still shaded inside Spark. This is now needed for the Hadoop libraries that are packaged with Spark, which now are not processed by the shade plugin. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11796 from vanzin/SPARK-13579.
*	[SPARK-14334] [SQL] add toLocalIterator for Dataset/DataFrame	Davies Liu	2016-04-04	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? RDD.toLocalIterator() could be used to fetch one partition at a time to reduce the memory usage. Right now, for Dataset/Dataframe we have to use df.rdd.toLocalIterator, which is super slow also requires lots of memory (because of the Java serializer or even Kyro serializer). This PR introduce an optimized toLocalIterator for Dataset/DataFrame, which is much faster and requires much less memory. For a partition with 5 millions rows, `df.rdd.toIterator` took about 100 seconds, but df.toIterator took less than 7 seconds. For 10 millions row, rdd.toIterator will crash (not enough memory) with 4G heap, but df.toLocalIterator could finished in 12 seconds. The JDBC server has been updated to use DataFrame.toIterator. ## How was this patch tested? Existing tests. Author: Davies Liu <davies@databricks.com> Closes #12114 from davies/local_iterator.
*	[SPARK-14358] Change SparkListener from a trait to an abstract class	Reynold Xin	2016-04-04	3	-188/+264
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Scala traits are difficult to maintain binary compatibility on, and as a result we had to introduce JavaSparkListener. In Spark 2.0 we can change SparkListener from a trait to an abstract class and then remove JavaSparkListener. ## How was this patch tested? Updated related unit tests. Author: Reynold Xin <rxin@databricks.com> Closes #12142 from rxin/SPARK-14358.
*	[SPARK-14364][SPARK] HeartbeatReceiver object should be private	Reynold Xin	2016-04-04	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? It's a mistake that HeartbeatReceiver object was made public in Spark 1.x. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #12148 from rxin/SPARK-14364.
*	[SPARK-14355][BUILD] Fix typos in Exception/Testcase/Comments and static ↵	Dongjoon Hyun	2016-04-03	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	analysis results ## What changes were proposed in this pull request? This PR contains the following 5 types of maintenance fix over 59 files (+94 lines, -93 lines). - Fix typos(exception/log strings, testcase name, comments) in 44 lines. - Fix lint-java errors (MaxLineLength) in 6 lines. (New codes after SPARK-14011) - Use diamond operators in 40 lines. (New codes after SPARK-13702) - Fix redundant semicolon in 5 lines. - Rename class `InferSchemaSuite` to `CSVInferSchemaSuite` in CSVInferSchemaSuite.scala. ## How was this patch tested? Manual and pass the Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12139 from dongjoon-hyun/SPARK-14355.
*	[SPARK-14163][CORE] SumEvaluator and countApprox cannot reliably handle RDDs ↵	Marcin Tustin	2016-04-03	2	-13/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	of size 1 ## What changes were proposed in this pull request? This special cases 0 and 1 counts to avoid passing 0 degrees of freedom. ## How was this patch tested? Tests run successfully. New test added. ## Note: This recreates #11982 which was closed to due to non-updated diff. rxin srowen Commented there. This also adds tests, reworks the code to perform the special casing (based on srowen's comments), and adds equality machinery for BoundedDouble, as well as changing how it is transformed to string. Author: Marcin Tustin <mtustin@handybook.com> Author: Marcin Tustin <mtustin@handy.com> Closes #12016 from mtustin-handy/SPARK-14163.
*	[SPARK-14056] Appends s3 specific configurations and spark.hadoop con…	Sital Kedia	2016-04-02	1	-6/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Appends s3 specific configurations and spark.hadoop configurations to hive configuration. ## How was this patch tested? Tested by running a job on cluster. …figurations to hive configuration. Author: Sital Kedia <skedia@fb.com> Closes #11876 from sitalkedia/hiveConf.
*	[SPARK-14342][CORE][DOCS][TESTS] Remove straggler references to Tachyon	Liwei Lin	2016-04-02	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Straggler references to Tachyon were removed: - for docs, `tachyon` has been generalized as `off-heap memory`; - for Mesos test suits, the key-value `tachyon:true`/`tachyon:false` has been changed to `os:centos`/`os:ubuntu`, since `os` is an example constrain used by the [Mesos official docs](http://mesos.apache.org/documentation/attributes-resources/). ## How was this patch tested? Existing test suites. Author: Liwei Lin <lwlin7@gmail.com> Closes #12129 from lw-lin/tachyon-cleanup.
*	[MINOR][DOCS] Use multi-line JavaDoc comments in Scala code.	Dongjoon Hyun	2016-04-02	14	-133/+147
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR aims to fix all Scala-Style multiline comments into Java-Style multiline comments in Scala codes. (All comment-only changes over 77 files: +786 lines, −747 lines) ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12130 from dongjoon-hyun/use_multiine_javadoc_comments.
*	[SPARK-13241][WEB UI] Added long values for dates in ApplicationAttemptInfo API	Alex Bozarth	2016-04-01	1	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Adding long values for each Date in the ApplicationAttemptInfo API for easier use in code ## How was the this patch tested? Tested with dev/run-tests Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #11326 from ajbozarth/spark13241.
*	[SPARK-13992] Add support for off-heap caching	Josh Rosen	2016-04-01	10	-126/+223
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds support for caching blocks in the executor processes using direct / off-heap memory. ## User-facing changes Updated semantics of `OFF_HEAP` storage level: In Spark 1.x, the `OFF_HEAP` storage level indicated that an RDD should be cached in Tachyon. Spark 2.x removed the external block store API that Tachyon caching was based on (see #10752 / SPARK-12667), so `OFF_HEAP` became an alias for `MEMORY_ONLY_SER`. As of this patch, `OFF_HEAP` means "serialized and cached in off-heap memory or on disk". Via the `StorageLevel` constructor, `useOffHeap` can be set if `serialized == true` and can be used to construct custom storage levels which support replication. Storage UI reporting: the storage UI will now report whether in-memory blocks are stored on- or off-heap. Only supported by UnifiedMemoryManager: for simplicity, this feature is only supported when the default UnifiedMemoryManager is used; applications which use the legacy memory manager (`spark.memory.useLegacyMode=true`) are not currently able to allocate off-heap storage memory, so using off-heap caching will fail with an error when legacy memory management is enabled. Given that we plan to eventually remove the legacy memory manager, this is not a significant restriction. Memory management policies: the policies for dividing available memory between execution and storage are the same for both on- and off-heap memory. For off-heap memory, the total amount of memory available for use by Spark is controlled by `spark.memory.offHeap.size`, which is an absolute size. Off-heap storage memory obeys `spark.memory.storageFraction` in order to control the amount of unevictable storage memory. For example, if `spark.memory.offHeap.size` is 1 gigabyte and Spark uses the default `storageFraction` of 0.5, then up to 500 megabytes of off-heap cached blocks will be protected from eviction due to execution memory pressure. If necessary, we can split `spark.memory.storageFraction` into separate on- and off-heap configurations, but this doesn't seem necessary now and can be done later without any breaking changes. Use of off-heap memory does not imply use of off-heap execution (or vice-versa): for now, the settings controlling the use of off-heap execution memory (`spark.memory.offHeap.enabled`) and off-heap caching are completely independent, so Spark SQL can be configured to use off-heap memory for execution while continuing to cache blocks on-heap. If desired, we can change this in a followup patch so that `spark.memory.offHeap.enabled` affect the default storage level for cached SQL tables. ## Internal changes - Rename `ByteArrayChunkOutputStream` to `ChunkedByteBufferOutputStream` - It now returns a `ChunkedByteBuffer` instead of an array of byte arrays. - Its constructor now accept an `allocator` function which is called to allocate `ByteBuffer`s. This allows us to control whether it allocates regular ByteBuffers or off-heap DirectByteBuffers. - Because block serialization is now performed during the unroll process, a `ChunkedByteBufferOutputStream` which is configured with a `DirectByteBuffer` allocator will use off-heap memory for both unroll and storage memory. - The `MemoryStore`'s MemoryEntries now tracks whether blocks are stored on- or off-heap. - `evictBlocksToFreeSpace()` now accepts a `MemoryMode` parameter so that we don't try to evict off-heap blocks in response to on-heap memory pressure (or vice-versa). - Make sure that off-heap buffers are properly de-allocated during MemoryStore eviction. - The JVM limits the total size of allocated direct byte buffers using the `-XX:MaxDirectMemorySize` flag and the default tends to be fairly low (< 512 megabytes in some JVMs). To work around this limitation, this patch adds a custom DirectByteBuffer allocator which ignores this memory limit. Author: Josh Rosen <joshrosen@databricks.com> Closes #11805 from JoshRosen/off-heap-caching.
*	[SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster ↵	zhonghaihua	2016-04-01	2	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \|	killed for max n… Currently, when max number of executor failures reached the `maxNumExecutorFailures`, `ApplicationMaster` will be killed and re-register another one.This time, `YarnAllocator` will be created a new instance. But, the value of property `executorIdCounter` in `YarnAllocator` will reset to `0`. Then the Id of new executor will starting from `1`. This will confuse with the executor has already created before, which will cause FetchFailedException. This situation is just in yarn client mode, so this is an issue in yarn client mode. For more details, [link to jira issues SPARK-12864](https://issues.apache.org/jira/browse/SPARK-12864) This PR introduce a mechanism to initialize `executorIdCounter` after `ApplicationMaster` killed. Author: zhonghaihua <793507405@qq.com> Closes #10794 from zhonghaihua/initExecutorIdCounterAfterAMKilled.
*	[SPARK-13674] [SQL] Add wholestage codegen support to Sample	Liang-Chi Hsieh	2016-04-01	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	JIRA: https://issues.apache.org/jira/browse/SPARK-13674 ## What changes were proposed in this pull request? Sample operator doesn't support wholestage codegen now. This pr is to add support to it. ## How was this patch tested? A test is added into `BenchmarkWholeStageCodegen`. Besides, all tests should be passed. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #11517 from viirya/add-wholestage-sample.
*	[SPARK-12343][YARN] Simplify Yarn client and client argument	jerryshao	2016-04-01	2	-27/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Currently in Spark on YARN, configurations can be passed through SparkConf, env and command arguments, some parts are duplicated, like client argument and SparkConf. So here propose to simplify the command arguments. ## How was this patch tested? This patch is tested manually with unit test. CC vanzin tgravescs , please help to suggest this proposal. The original purpose of this JIRA is to remove `ClientArguments`, through refactoring some arguments like `--class`, `--arg` are not so easy to replace, so here I remove the most part of command line arguments, only keep the minimal set. Author: jerryshao <sshao@hortonworks.com> Closes #11603 from jerryshao/SPARK-12343.
*	[SPARK-14267] [SQL] [PYSPARK] execute multiple Python UDFs within single batch	Davies Liu	2016-03-31	1	-15/+49
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR support multiple Python UDFs within single batch, also improve the performance. ```python >>> from pyspark.sql.types import IntegerType >>> sqlContext.registerFunction("double", lambda x: x * 2, IntegerType()) >>> sqlContext.registerFunction("add", lambda x, y: x + y, IntegerType()) >>> sqlContext.sql("SELECT double(add(1, 2)), add(double(2), 1)").explain(True) == Parsed Logical Plan == 'Project [unresolvedalias('double('add(1, 2)), None),unresolvedalias('add('double(2), 1), None)] +- OneRowRelation$ == Analyzed Logical Plan == double(add(1, 2)): int, add(double(2), 1): int Project [double(add(1, 2))#14,add(double(2), 1)#15] +- Project [double(add(1, 2))#14,add(double(2), 1)#15] +- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15] +- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18] +- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17] +- OneRowRelation$ == Optimized Logical Plan == Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15] +- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18] +- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17] +- OneRowRelation$ == Physical Plan == WholeStageCodegen : +- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15] : +- INPUT +- !BatchPythonEvaluation [add(pythonUDF1#17, 1)], [pythonUDF0#16,pythonUDF1#17,pythonUDF0#18] +- !BatchPythonEvaluation [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17] +- Scan OneRowRelation[] ``` ## How was this patch tested? Added new tests. Using the following script to benchmark 1, 2 and 3 udfs, ``` df = sqlContext.range(1, 1 << 23, 1, 4) double = F.udf(lambda x: x * 2, LongType()) print df.select(double(df.id)).count() print df.select(double(df.id), double(df.id + 1)).count() print df.select(double(df.id), double(df.id + 1), double(df.id + 2)).count() ``` Here is the results: N \| Before \| After \| speed up ---- \|------------ \| -------------\|------ 1 \| 22 s \| 7 s \| 3.1X 2 \| 38 s \| 13 s \| 2.9X 3 \| 58 s \| 16 s \| 3.6X This benchmark ran locally with 4 CPUs. For 3 UDFs, it launched 12 Python before before this patch, 4 process after this patch. After this patch, it will use less memory for multiple UDFs than before (less buffering). Author: Davies Liu <davies@databricks.com> Closes #12057 from davies/multi_udfs.
*	[SPARK-11327][MESOS] Dispatcher does not respect all args from the Submit ↵	Jo Voordeckers	2016-03-31	1	-0/+26
\| \| \| \| \| \| \| \| \| \| \|	request Supersedes https://github.com/apache/spark/pull/9752 Author: Jo Voordeckers <jo.voordeckers@gmail.com> Author: Iulian Dragos <jaguarul@gmail.com> Closes #10370 from jayv/mesos_cluster_params.
*	[SPARK-14069][SQL] Improve SparkStatusTracker to also track executor information	Wenchen Fan	2016-03-31	5	-16/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Track executor information like host and port, cache size, running tasks. TODO: tests ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #11888 from cloud-fan/status-tracker.
*	[SPARK-14243][CORE] update task metrics when removing blocks	jeanlyn	2016-03-31	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR try to use `incUpdatedBlockStatuses ` to update the `updatedBlockStatuses ` when removing blocks, making sure `BlockManager` correctly updates `updatedBlockStatuses` ## How was this patch tested? test("updated block statuses") in BlockManagerSuite.scala Author: jeanlyn <jeanlyn92@gmail.com> Closes #12091 from jeanlyn/updateBlock.
*	[SPARK-13796] Redirect error message to logWarning	Nishkam Ravi	2016-03-31	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Redirect error message to logWarning ## How was this patch tested? Unit tests, manual tests JoshRosen Author: Nishkam Ravi <nishkamravi@gmail.com> Closes #12052 from nishkamravi2/master_warning.
*	[SPARK-12181] Check Cached unaligned-access capability before using Unsafe	tedyu	2016-03-29	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? For MemoryMode.OFF_HEAP, Unsafe.getInt etc. are used with no restriction. However, the Oracle implementation uses these methods only if the class variable unaligned (commented as "Cached unaligned-access capability") is true, which seems to be calculated whether the architecture is i386, x86, amd64, or x86_64. I think we should perform similar check for the use of Unsafe. Reference: https://github.com/netty/netty/blob/4.1/common/src/main/java/io/netty/util/internal/PlatformDependent0.java#L112 ## How was this patch tested? Unit test suite Author: tedyu <yuzhihong@gmail.com> Closes #11943 from tedyu/master.
*	[SPARK-14215] [SQL] [PYSPARK] Support chained Python UDFs	Davies Liu	2016-03-29	1	-16/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR brings the support for chained Python UDFs, for example ```sql select udf1(udf2(a)) select udf1(udf2(a) + 3) select udf1(udf2(a) + udf3(b)) ``` Also directly chained unary Python UDFs are put in single batch of Python UDFs, others may require multiple batches. For example, ```python >>> sqlContext.sql("select double(double(1))").explain() == Physical Plan == WholeStageCodegen : +- Project [pythonUDF#10 AS double(double(1))#9] : +- INPUT +- !BatchPythonEvaluation double(double(1)), [pythonUDF#10] +- Scan OneRowRelation[] >>> sqlContext.sql("select double(double(1) + double(2))").explain() == Physical Plan == WholeStageCodegen : +- Project [pythonUDF#19 AS double((double(1) + double(2)))#16] : +- INPUT +- !BatchPythonEvaluation double((pythonUDF#17 + pythonUDF#18)), [pythonUDF#17,pythonUDF#18,pythonUDF#19] +- !BatchPythonEvaluation double(2), [pythonUDF#17,pythonUDF#18] +- !BatchPythonEvaluation double(1), [pythonUDF#17] +- Scan OneRowRelation[] ``` TODO: will support multiple unrelated Python UDFs in one batch (another PR). ## How was this patch tested? Added new unit tests for chained UDFs. Author: Davies Liu <davies@databricks.com> Closes #12014 from davies/py_udfs.
*	[SPARK-10570][CORE] Add version info to json api	Jakob Odersky	2016-03-29	3	-0/+39
\| \| \| \| \| \| \| \|	Add a new api endpoint `/api/v1/version` to retrieve various version info. This PR only adds support for finding the current spark version, however other version info such as jvm or scala versions can easily be added. Author: Jakob Odersky <jodersky@gmail.com> Closes #10760 from jodersky/SPARK-10570.
*	[SPARK-14232][WEBUI] Fix event timeline display issue when an executor is ↵	Carson Wang	2016-03-29	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	removed with a multiple line reason. ## What changes were proposed in this pull request? The event timeline doesn't show on job page if an executor is removed with a multiple line reason. This PR replaces all new line characters in the reason string with spaces. ![timelineerror](https://cloud.githubusercontent.com/assets/9278199/14100211/5fd4cd30-f5be-11e5-9cea-f32651a4cd62.jpg) ## How was this patch tested? Verified on the Web UI. Author: Carson Wang <carson.wang@intel.com> Closes #12029 from carsonwang/eventTimeline.
*	[SPARK-12792] [SPARKR] Refactor RRDD to support R UDF.	Sun Rui	2016-03-28	2	-324/+372
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Refactor RRDD by separating the common logic interacting with the R worker to a new class RRunner, which can be used to evaluate R UDFs. Now RRDD relies on RRuner for RDD computation and RRDD could be reomved if we want to remove RDD API in SparkR later. ## How was this patch tested? dev/lint-r SparkR unit tests Author: Sun Rui <rui.sun@intel.com> Closes #12024 from sun-rui/SPARK-12792_new.
*	[SPARK-13447][YARN][CORE] Clean the stale states for AM failure and restart ↵	jerryshao	2016-03-28	1	-12/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	situation ## What changes were proposed in this pull request? This is a follow-up fix of #9963, in #9963 we handle this stale states clean-up work only for dynamic allocation enabled scenario. Here we should also clean the states in `CoarseGrainedSchedulerBackend` for dynamic allocation disabled scenario. Please review, CC andrewor14 lianhuiwang , thanks a lot. ## How was this patch tested? Run the unit test locally, also with integration test manually. Author: jerryshao <sshao@hortonworks.com> Closes #11366 from jerryshao/SPARK-13447.
*	[SPARK-13845][CORE] Using onBlockUpdated to replace onTaskEnd avioding ↵	jeanlyn	2016-03-28	2	-22/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	driver OOM ## What changes were proposed in this pull request? We have a streaming job using `FlumePollInputStream` always driver OOM after few days, here is some driver heap dump before OOM ``` num #instances #bytes class name ---------------------------------------------- 1: 13845916 553836640 org.apache.spark.storage.BlockStatus 2: 14020324 336487776 org.apache.spark.storage.StreamBlockId 3: 13883881 333213144 scala.collection.mutable.DefaultEntry 4: 8907 89043952 [Lscala.collection.mutable.HashEntry; 5: 62360 65107352 [B 6: 163368 24453904 [Ljava.lang.Object; 7: 293651 20342664 [C ... ``` `BlockStatus` and `StreamBlockId` keep on growing, and the driver OOM in the end. After investigated, i found the `executorIdToStorageStatus` in `StorageStatusListener` seems never remove the blocks from `StorageStatus`. In order to fix the issue, i try to use `onBlockUpdated` replace `onTaskEnd ` , so we can update the block informations(add blocks, drop the block from memory to disk and delete the blocks) in time. ## How was this patch tested? Existing unit tests and manual tests Author: jeanlyn <jeanlyn92@gmail.com> Closes #11779 from jeanlyn/fix_driver_oom.
*	[SPARK-14169][CORE] Add UninterruptibleThread	Shixiong Zhu	2016-03-28	1	-0/+112
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Extract the workaround for HADOOP-10622 introduced by #11940 into UninterruptibleThread so that we can test and reuse it. ## How was this patch tested? Unit tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #11971 from zsxwing/uninterrupt.
*	[SPARK-14180][CORE] Fix a deadlock in CoarseGrainedExecutorBackend Shutdown	Shixiong Zhu	2016-03-28	1	-3/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Call `executor.stop` in a new thread to eliminate deadlock. ## How was this patch tested? Existing unit tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #12012 from zsxwing/SPARK-14180.
*	Revert "[SPARK-12792] [SPARKR] Refactor RRDD to support R UDF."	Davies Liu	2016-03-28	2	-371/+324
\| \| \| \|	This reverts commit 40984f67065eeaea731940008e6677c2323dda3e.
*	[SPARK-12792] [SPARKR] Refactor RRDD to support R UDF.	Sun Rui	2016-03-28	2	-324/+371
\| \| \| \| \| \| \| \| \| \|	Refactor RRDD by separating the common logic interacting with the R worker to a new class RRunner, which can be used to evaluate R UDFs. Now RRDD relies on RRuner for RDD computation and RRDD could be reomved if we want to remove RDD API in SparkR later. Author: Sun Rui <rui.sun@intel.com> Closes #10947 from sun-rui/SPARK-12792.
*	[SPARK-13742] [CORE] Add non-iterator interface to RandomSampler	Liang-Chi Hsieh	2016-03-28	1	-111/+90
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	JIRA: https://issues.apache.org/jira/browse/SPARK-13742 ## What changes were proposed in this pull request? `RandomSampler.sample` currently accepts iterator as input and output another iterator. This makes it inappropriate to use in wholestage codegen of `Sampler` operator #11517. This change is to add non-iterator interface to `RandomSampler`. This change adds a new method `def sample(): Int` to the trait `RandomSampler`. As we don't need to know the actual values of the sampling items, so this new method takes no arguments. This method will decide whether to sample the next item or not. It returns how many times the next item will be sampled. For `BernoulliSampler` and `BernoulliCellSampler`, the returned sampling times can only be 0 or 1. It simply means whether to sample the next item or not. For `PoissonSampler`, the returned value can be more than 1, meaning the next item will be sampled multiple times. ## How was this patch tested? Tests are added into `RandomSamplerSuite`. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Liang-Chi Hsieh <viirya@appier.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #11578 from viirya/random-sampler-no-iterator.
*	[SPARK-14135] Add off-heap storage memory bookkeeping support to MemoryManager	Josh Rosen	2016-03-26	8	-114/+192
\| \| \| \| \| \| \| \| \| \|	This patch extends Spark's `UnifiedMemoryManager` to add bookkeeping support for off-heap storage memory, an requirement for enabling off-heap caching (which will be done by #11805). The `MemoryManager`'s `storageMemoryPool` has been split into separate on- and off-heap pools and the storage and unroll memory allocation methods have been updated to accept a `memoryMode` parameter to specify whether allocations should be performed on- or off-heap. In order to reduce the testing surface, the `StaticMemoryManager` does not support off-heap caching (we plan to eventually remove the `StaticMemoryManager`, so this isn't a significant limitation). Author: Josh Rosen <joshrosen@databricks.com> Closes #11942 from JoshRosen/off-heap-storage-memory-bookkeeping.
*	[SPARK-14089][CORE][MLLIB] Remove methods that has been deprecated since ↵	Liwei Lin	2016-03-26	1	-8/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	1.1, 1.2, 1.3, 1.4, and 1.5 ## What changes were proposed in this pull request? Removed methods that has been deprecated since 1.1, 1.2, 1.3, 1.4, and 1.5. ## How was this patch tested? - manully checked that no codes in Spark call these methods any more - existing test suits Author: Liwei Lin <lwlin7@gmail.com> Author: proflin <proflin.me@gmail.com> Closes #11910 from lw-lin/remove-deprecates.
*	[MINOR] Fix newly added java-lint errors	Dongjoon Hyun	2016-03-26	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR fixes some newly added java-lint errors(unused-imports, line-lengsth). ## How was this patch tested? Pass the Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11968 from dongjoon-hyun/SPARK-14167.
*	[SPARK-14091][CORE] Improve performance of SparkContext.getCallSite()	Rajesh Balamohan	2016-03-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently SparkContext.getCallSite() makes a call to Utils.getCallSite(). ``` private[spark] def getCallSite(): CallSite = { val callSite = Utils.getCallSite() CallSite( Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm), Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm) ) } ``` However, in some places utils.withDummyCallSite(sc) is invoked to avoid expensive threaddumps within getCallSite(). But Utils.getCallSite() is evaluated earlier causing threaddumps to be computed. This can have severe impact on smaller queries (that finish in 10-20 seconds) having large number of RDDs. Creating this patch for lazy evaluation of getCallSite. No new test cases are added. Following standalone test was tried out manually. Also, built entire spark binary and tried with few SQL queries in TPC-DS and TPC-H in multi node cluster ``` def run(): Unit = { val conf = new SparkConf() val sc = new SparkContext("local[1]", "test-context", conf) val start: Long = System.currentTimeMillis(); val confBroadcast = sc.broadcast(new SerializableConfiguration(new Configuration())) Utils.withDummyCallSite(sc) { //Large tables end up creating 5500 RDDs for(i <- 1 to 5000) { //ignore nulls in RDD as its mainly for testing callSite val testRDD = new HadoopRDD(sc, confBroadcast, None, null, classOf[NullWritable], classOf[Writable], 10) } } val end: Long = System.currentTimeMillis(); println("Time taken : " + (end - start)) } def main(args: Array[String]): Unit = { run } ``` Author: Rajesh Balamohan <rbalamohan@apache.org> Closes #11911 from rajeshbalamohan/SPARK-14091.
*	[SPARK-14149] Log exceptions in tryOrIOException	Reynold Xin	2016-03-25	1	-17/+6
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We ran into a problem today debugging some class loading problem during deserialization, and JVM was masking the underlying exception which made it very difficult to debug. We can however log the exceptions using try/catch ourselves in serialization/deserialization. The good thing is that all these methods are already using Utils.tryOrIOException, so we can just put the try catch and logging in a single place. ## How was this patch tested? A logging change with a manual test. Author: Reynold Xin <rxin@databricks.com> Closes #11951 from rxin/SPARK-14149.
*	[SPARK-13980] Incrementally serialize blocks while unrolling them in MemoryStore	Josh Rosen	2016-03-24	4	-43/+248
\| \| \| \| \| \| \| \| \| \| \| \|	When a block is persisted in the MemoryStore at a serialized storage level, the current MemoryStore.putIterator() code will unroll the entire iterator as Java objects in memory, then will turn around and serialize an iterator obtained from the unrolled array. This is inefficient and doubles our peak memory requirements. Instead, I think that we should incrementally serialize blocks while unrolling them. A downside to incremental serialization is the fact that we will need to deserialize the partially-unrolled data in case there is not enough space to unroll the block and the block cannot be dropped to disk. However, I'm hoping that the memory efficiency improvements will outweigh any performance losses as a result of extra serialization in that hopefully-rare case. Author: Josh Rosen <joshrosen@databricks.com> Closes #11791 from JoshRosen/serialize-incrementally.
*	[SPARK-14110][CORE] PipedRDD to print the command ran on non zero exit	Tejas Patil	2016-03-24	1	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? In case of failure in subprocess launched in PipedRDD, the failure exception reads “Subprocess exited with status XXX”. Debugging this is not easy for users especially if there are multiple pipe() operations in the Spark application. Changes done: - Changed the exception message when non-zero exit code is seen - If the reader and writer threads see exception, simply logging the command ran. The current model is to propagate the exception "as is" so that upstream Spark logic will take the right action based on what the exception was (eg. for fetch failure, it needs to retry; but for some fatal exception, it will decide to fail the stage / job). So wrapping the exception with a generic exception will not work. Altering the exception message will keep that guarantee but that is ugly (plus not all exceptions might have a constructor for a string message) ## How was this patch tested? - Added a new test case - Ran all existing tests for PipedRDD Author: Tejas Patil <tejasp@fb.com> Closes #11927 from tejasapatil/SPARK-14110-piperdd-failure.