spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-7562][SPARK-6444][SQL] Improve error reporting for expression data ↵	Wenchen Fan	2015-06-03	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	type mismatch It seems hard to find a common pattern of checking types in `Expression`. Sometimes we know what input types we need(like `And`, we know we need two booleans), sometimes we just have some rules(like `Add`, we need 2 numeric types which are equal). So I defined a general interface `checkInputDataTypes` in `Expression` which returns a `TypeCheckResult`. `TypeCheckResult` can tell whether this expression passes the type checking or what the type mismatch is. This PR mainly works on apply input types checking for arithmetic and predicate expressions. TODO: apply type checking interface to more expressions. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #6405 from cloud-fan/6444 and squashes the following commits: b5ff31b [Wenchen Fan] address comments b917275 [Wenchen Fan] rebase 39929d9 [Wenchen Fan] add todo 0808fd2 [Wenchen Fan] make constrcutor of TypeCheckResult private 3bee157 [Wenchen Fan] and decimal type coercion rule for binary comparison 8883025 [Wenchen Fan] apply type check interface to CaseWhen cffb67c [Wenchen Fan] to have resolved call the data type check function 6eaadff [Wenchen Fan] add equal type constraint to EqualTo 3affbd8 [Wenchen Fan] more fixes 654d46a [Wenchen Fan] improve tests e0a3628 [Wenchen Fan] improve error message 1524ff6 [Wenchen Fan] fix style 69ca3fe [Wenchen Fan] add error message and tests c71d02c [Wenchen Fan] fix hive tests 6491721 [Wenchen Fan] use value class TypeCheckResult 7ae76b9 [Wenchen Fan] address comments cb77e4f [Wenchen Fan] Improve error reporting for expression data type mismatch
*	[SPARK-8027] [SPARKR] Add maven profile to build R package docs	Shivaram Venkataraman	2015-06-01	1	-0/+23
\| \| \| \| \| \| \| \| \| \| \| \|	Also use that profile in create-release.sh cc pwendell -- Note that this means that we need `knitr` and `roxygen` installed on the machines used for building the release. Let me know if you need help with that. Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #6567 from shivaram/SPARK-8027 and squashes the following commits: 8dc8ecf [Shivaram Venkataraman] Add maven profile to build R package docs Also use that profile in create-release.sh
*	[SPARK-8028] [SPARKR] Use addJar instead of setJars in SparkR	Shivaram Venkataraman	2015-06-01	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \|	This prevents the spark.jars from being cleared while using `--packages` or `--jars` cc pwendell davies brkyvz Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #6568 from shivaram/SPARK-8028 and squashes the following commits: 3a9cf1f [Shivaram Venkataraman] Use addJar instead of setJars in SparkR This prevents the spark.jars from being cleared
*	[MINOR] [UI] Improve error message on log page	Andrew Or	2015-06-01	2	-0/+76
\| \| \| \| \|	Currently if a bad log type if specified, then we get blank. We should provide a more informative error message.
*	[SPARK-7227] [SPARKR] Support fillna / dropna in R DataFrame.	Sun Rui	2015-05-31	1	-2/+4
\| \| \| \| \| \| \| \| \|	Author: Sun Rui <rui.sun@intel.com> Closes #6183 from sun-rui/SPARK-7227 and squashes the following commits: dd6f5b3 [Sun Rui] Rename readEnv() back to readMap(). Add alias na.omit() for dropna(). 41cf725 [Sun Rui] [SPARK-7227][SPARKR] Support fillna / dropna in R DataFrame.
*	[SPARK-7979] Enforce structural type checker.	Reynold Xin	2015-05-31	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Author: Reynold Xin <rxin@databricks.com> Closes #6536 from rxin/structural-type-checker and squashes the following commits: f833151 [Reynold Xin] Fixed compilation. 633f9a1 [Reynold Xin] Fixed typo. d1fa804 [Reynold Xin] [SPARK-7979] Enforce structural type checker.
*	[SPARK-3850] Trim trailing spaces for core.	Reynold Xin	2015-05-31	47	-117/+117
\| \| \| \| \| \| \| \|	Author: Reynold Xin <rxin@databricks.com> Closes #6533 from rxin/whitespace-2 and squashes the following commits: 038314c [Reynold Xin] [SPARK-3850] Trim trailing spaces for core.
*	[SPARK-7976] Add style checker to disallow overriding finalize.	Reynold Xin	2015-05-30	1	-0/+2
\| \| \| \| \| \| \| \|	Author: Reynold Xin <rxin@databricks.com> Closes #6528 from rxin/style-finalizer and squashes the following commits: a2211ca [Reynold Xin] [SPARK-7976] Enable NoFinalizeChecker.
*	[HOTFIX] Replace FunSuite with SparkFunSuite.	Josh Rosen	2015-05-30	3	-9/+4
\| \| \| \| \| \|	This fixes a build break introduced by merging a6430028ecd7a6130f1eb15af9ec00e242c46725, which fails the new style checks that ensure that we use SparkFunSuite instead of FunSuite.
*	[SPARK-7855] Move bypassMergeSort-handling from ExternalSorter to own component	Josh Rosen	2015-05-30	17	-423/+738
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Spark's `ExternalSorter` writes shuffle output files during sort-based shuffle. Sort-shuffle contains a configuration, `spark.shuffle.sort.bypassMergeThreshold`, which causes ExternalSorter to skip sorting and merging and simply write separate files per partition, which are then concatenated together to form the final map output file. The code paths used during this bypass are almost completely separate from ExternalSorter's other code paths, so refactoring them into a separate file can significantly simplify the code. In addition to re-arranging code, this patch deletes a bunch of dead code. The main entry point into ExternalSorter is `insertAll()` and in SPARK-4479 / #3422 this method was modified to completely bypass in-memory buffering of records when `bypassMergeSort` takes effect. As a result, some of the spilling and merging code paths will no longer be called when `bypassMergeSort` is used, so we should be able to safely remove that code. There's an open JIRA ([SPARK-6026](https://issues.apache.org/jira/browse/SPARK-6026)) for removing the `bypassMergeThreshold` parameter and code paths; I have not done that here, but the changes in this patch will make removing that parameter significantly easier if we ever decide to do that. This patch also makes several improvements to shuffle-related tests and adds more defensive checks to certain shuffle classes: - DiskBlockObjectWriter now throws an exception if `fileSegment()` is called before `commitAndClose()` has been called. - DiskBlockObjectWriter's close methods are now idempotent, so calling any of the close methods twice in a row will no longer result in incorrect shuffle write metrics changes. Calling `revertPartialWritesAndClose()` on a closed DiskBlockObjectWriter now has no effect (before, it might mess up the metrics). - The end-to-end shuffle record count metrics tests have been moved from InputOutputMetricsSuite to ShuffleSuite. This means that these tests will now be run against all shuffle implementations rather than just the default shuffle configuration. - The end-to-end metrics tests now include a test of a job which performs aggregation in the shuffle. - Our tests now check that `shuffleBytesWritten == totalShuffleBytesRead`. - FileSegment now throws IllegalArgumentException if it is constructed with a negative length or offset. Author: Josh Rosen <joshrosen@databricks.com> Closes #6397 from JoshRosen/external-sorter-bypass-cleanup and squashes the following commits: bf3f3f6 [Josh Rosen] Merge remote-tracking branch 'origin/master' into external-sorter-bypass-cleanup 8b216c4 [Josh Rosen] Guard against negative offsets and lengths in FileSegment 03f35a4 [Josh Rosen] Minor fix to cleanup logic. b5cc35b [Josh Rosen] Move shuffle metrics tests to ShuffleSuite. 8b8fb9e [Josh Rosen] Add more tests + defensive programming to DiskBlockObjectWriter. 16564eb [Josh Rosen] Guard against calling fileSegment() before commitAndClose() has been called. 96811b4 [Josh Rosen] Remove confusing taskMetrics.shuffleWriteMetrics() optional call 8522b6a [Josh Rosen] Do not perform a map-side sort unless we're also doing map-side aggregation 08e40f3 [Josh Rosen] Remove excessively clever (and wrong) implementation of newBuffer() d7f9938 [Josh Rosen] Add missing overrides; fix compilation 71d76ff [Josh Rosen] Update Javadoc bf0d98f [Josh Rosen] Add comment to clarify confusing factory code 5197f73 [Josh Rosen] Add missing private[this] 30ef2c8 [Josh Rosen] Convert BypassMergeSortShuffleWriter to Java bc1a820 [Josh Rosen] Fix bug when aggregator is used but map-side combine is disabled 0d3dcc0 [Josh Rosen] Remove unnecessary overloaded methods 25b964f [Josh Rosen] Rename SortShuffleSorter to SortShuffleFileWriter 0d9848c [Josh Rosen] Make it more clear that curWriteMetrics is now only used for spill metrics 7af7aea [Josh Rosen] Combine spill() and spillToMergeableFile() 6320112 [Josh Rosen] Add missing negation in deletion success check. d267e0d [Josh Rosen] Fix style issue 7f15f7b [Josh Rosen] Back out extra cleanup-handling code, since this is already covered in stop() 25aa3bd [Josh Rosen] Make sure to delete outputFile after errors. 931ca68 [Josh Rosen] Refactor tests. 6a35716 [Josh Rosen] Refactor logic for deciding when to bypass 4b03539 [Josh Rosen] Move conf prior to first use 1265b25 [Josh Rosen] Fix some style errors and comments. 02355ef [Josh Rosen] More simplification d4cb536 [Josh Rosen] Delete more unused code bb96678 [Josh Rosen] Add missing interface file b6cc1eb [Josh Rosen] Realize that bypass never buffers; proceed to delete tons of code 6185ee2 [Josh Rosen] WIP towards moving bypass code into own file. 8d0678c [Josh Rosen] Move diskBytesSpilled getter next to variable 19bccd6 [Josh Rosen] Remove duplicated buffer creation code. 18959bb [Josh Rosen] Move comparator methods closer together.
*	[SPARK-7717] [WEBUI] Only showing total memory and cores for alive workers	zhichao.li	2015-05-30	1	-6/+7
\| \| \| \| \| \| \| \| \| \|	Author: zhichao.li <zhichao.li@intel.com> Closes #6317 from zhichao-li/workers and squashes the following commits: d68bf11 [zhichao.li] change prefix 99b6768 [zhichao.li] remove extra space and add 'Alive' prefix 1e8eb06 [zhichao.li] only showing alive workers
*	[SPARK-7962] [MESOS] Fix master url parsing in rest submission client.	Timothy Chen	2015-05-29	1	-1/+5
\| \| \| \| \| \| \| \| \| \|	Only parse standalone master url when master url starts with spark:// Author: Timothy Chen <tnachen@gmail.com> Closes #6517 from tnachen/fix_mesos_client and squashes the following commits: 61a1198 [Timothy Chen] Fix master url parsing in rest submission client.
*	[SPARK-7558] Guard against direct uses of FunSuite / FunSuiteLike	Andrew Or	2015-05-29	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \|	This is a follow-up patch to #6441. Author: Andrew Or <andrew@databricks.com> Closes #6510 from andrewor14/extends-funsuite-check and squashes the following commits: 6618b46 [Andrew Or] Exempt SparkSinkSuite from the FunSuite check 99d02ac [Andrew Or] Merge branch 'master' of github.com:apache/spark into extends-funsuite-check 48874dd [Andrew Or] Guard against direct uses of FunSuite / FunSuiteLike
*	[SPARK-7957] Preserve partitioning when using randomSplit	Burak Yavuz	2015-05-29	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	cc JoshRosen Thanks for noticing this! Author: Burak Yavuz <brkyvz@gmail.com> Closes #6509 from brkyvz/sample-perf-reg and squashes the following commits: 497465d [Burak Yavuz] addressed code review 293f95f [Burak Yavuz] [SPARK-7957] Preserve partitioning when using randomSplit
*	[SPARK-7910] [TINY] [JAVAAPI] expose partitioner information in javardd	Holden Karau	2015-05-29	2	-3/+8
\| \| \| \| \| \| \| \| \| \| \|	Author: Holden Karau <holden@pigscanfly.ca> Closes #6464 from holdenk/SPARK-7910-expose-partitioner-information-in-javardd and squashes the following commits: de1e644 [Holden Karau] Fix the test to get the partitioner bdb31cc [Holden Karau] Add Mima exclude for the new method 347ef4c [Holden Karau] Add a quick little test for the partitioner JavaAPI f49dca9 [Holden Karau] Add partitoner information to JavaRDDLike and fix some whitespace
*	[SPARK-7558] Demarcate tests in unit-tests.log	Andrew Or	2015-05-29	147	-381/+380
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Right now `unit-tests.log` are not of much value because we can't tell where the test boundaries are easily. This patch adds log statements before and after each test to outline the test boundaries, e.g.: ``` ===== TEST OUTPUT FOR o.a.s.serializer.KryoSerializerSuite: 'kryo with parallelize for primitive arrays' ===== 15/05/27 12:36:39.596 pool-1-thread-1-ScalaTest-running-KryoSerializerSuite INFO SparkContext: Starting job: count at KryoSerializerSuite.scala:230 15/05/27 12:36:39.596 dag-scheduler-event-loop INFO DAGScheduler: Got job 3 (count at KryoSerializerSuite.scala:230) with 4 output partitions (allowLocal=false) 15/05/27 12:36:39.596 dag-scheduler-event-loop INFO DAGScheduler: Final stage: ResultStage 3(count at KryoSerializerSuite.scala:230) 15/05/27 12:36:39.596 dag-scheduler-event-loop INFO DAGScheduler: Parents of final stage: List() 15/05/27 12:36:39.597 dag-scheduler-event-loop INFO DAGScheduler: Missing parents: List() 15/05/27 12:36:39.597 dag-scheduler-event-loop INFO DAGScheduler: Submitting ResultStage 3 (ParallelCollectionRDD[5] at parallelize at KryoSerializerSuite.scala:230), which has no missing parents ... 15/05/27 12:36:39.624 pool-1-thread-1-ScalaTest-running-KryoSerializerSuite INFO DAGScheduler: Job 3 finished: count at KryoSerializerSuite.scala:230, took 0.028563 s 15/05/27 12:36:39.625 pool-1-thread-1-ScalaTest-running-KryoSerializerSuite INFO KryoSerializerSuite: *** FINISHED o.a.s.serializer.KryoSerializerSuite: 'kryo with parallelize for primitive arrays' *** ... ``` Author: Andrew Or <andrew@databricks.com> Closes #6441 from andrewor14/demarcate-tests and squashes the following commits: 879b060 [Andrew Or] Fix compile after rebase d622af7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests 017c8ba [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests 7790b6c [Andrew Or] Fix tests after logical merge conflict c7460c0 [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests c43ffc4 [Andrew Or] Fix tests? 8882581 [Andrew Or] Fix tests ee22cda [Andrew Or] Fix log message fa9450e [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests 12d1e1b [Andrew Or] Various whitespace changes (minor) 69cbb24 [Andrew Or] Make all test suites extend SparkFunSuite instead of FunSuite bbce12e [Andrew Or] Fix manual things that cannot be covered through automation da0b12f [Andrew Or] Add core tests as dependencies in all modules f7d29ce [Andrew Or] Introduce base abstract class for all test suites
*	[SPARK-7940] Enforce whitespace checking for DO, TRY, CATCH, FINALLY, MATCH, ↵	Reynold Xin	2015-05-29	4	-8/+6
\| \| \| \| \| \| \| \| \| \| \| \|	LARROW, RARROW in style checker. … Author: Reynold Xin <rxin@databricks.com> Closes #6491 from rxin/more-whitespace and squashes the following commits: f6e63dc [Reynold Xin] [SPARK-7940] Enforce whitespace checking for DO, TRY, CATCH, FINALLY, MATCH, LARROW, RARROW in style checker.
*	[SPARK-7524] [SPARK-7846] add configs for keytab and principal, pass these ↵	WangTaoTheTonic	2015-05-29	2	-4/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	two configs with different way in different modes * As spark now supports long running service by updating tokens for namenode, but only accept parameters passed with "--k=v" format which is not very convinient. This patch add spark.* configs in properties file and system property. * --principal and --keytabl options are passed to client but when we started thrift server or spark-shell these two are also passed into the Main class (org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 and org.apache.spark.repl.Main). In these two main class, arguments passed in will be processed with some 3rd libraries, which will lead to some error: "Invalid option: --principal" or "Unrecgnised option: --principal". We should pass these command args in different forms, say system properties. Author: WangTaoTheTonic <wangtao111@huawei.com> Closes #6051 from WangTaoTheTonic/SPARK-7524 and squashes the following commits: e65699a [WangTaoTheTonic] change logic to loadEnvironments ebd9ea0 [WangTaoTheTonic] merge master ecfe43a [WangTaoTheTonic] pass keytab and principal seperately in different mode 33a7f40 [WangTaoTheTonic] expand the use of the current configs 08bb4e8 [WangTaoTheTonic] fix wrong cite 73afa64 [WangTaoTheTonic] add configs for keytab and principal, move originals to internal
*	[SPARK-7863] [CORE] Create SimpleDateFormat for every SimpleDateParam ↵	zsxwing	2015-05-29	2	-28/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	instance because it's not thread-safe SimpleDateFormat is not thread-safe. This PR creates new `SimpleDateFormat` for each `SimpleDateParam` instance. Author: zsxwing <zsxwing@gmail.com> Closes #6406 from zsxwing/SPARK-7863 and squashes the following commits: aeed4c1 [zsxwing] Rewrite SimpleDateParam 8cdd986 [zsxwing] Inline formats 9680a15 [zsxwing] Create SimpleDateFormat for each SimpleDateParam instance because it's not thread-safe
*	[SPARK-7756] [CORE] Use testing cipher suites common to Oracle and IBM ↵	Tim Ellison	2015-05-29	2	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	security providers Add alias names for supported cipher suites to the sample SSL configuration. The IBM JSSE provider reports its cipher suite with an SSL_ prefix, but accepts TLS_ prefixed suite names as an alias. However, Jetty filters the requested ciphers based on the provider's reported supported suites, so the TLS_ versions are never passed through to JSSE causing an SSL handshake failure. Author: Tim Ellison <t.p.ellison@gmail.com> Closes #6282 from tellison/SSLFailure and squashes the following commits: 8de8a3e [Tim Ellison] Update SecurityManagerSuite with new expected suite names 96158b2 [Tim Ellison] Update the sample configs to use ciphers that are common to both the Oracle and IBM security providers. 705421b [Tim Ellison] Merge branch 'master' of github.com:tellison/spark into SSLFailure 68b9425 [Tim Ellison] Merge branch 'master' of https://github.com/apache/spark into SSLFailure b0c35f6 [Tim Ellison] [CORE] Add aliases used for cipher suites in IBM provider
*	[SPARK-7930] [CORE] [STREAMING] Fixed shutdown hook priorities	Tathagata Das	2015-05-28	2	-4/+12
\| \| \| \| \| \| \| \| \| \| \| \| \|	Shutdown hook for temp directories had priority 100 while SparkContext was 50. So the local root directory was deleted before SparkContext was shutdown. This leads to scary errors on running jobs, at the time of shutdown. This is especially a problem when running streaming examples, where Ctrl-C is the only way to shutdown. The fix in this PR is to make the temp directory shutdown priority lower than SparkContext, so that the temp dirs are the last thing to get deleted, after the SparkContext has been shut down. Also, the DiskBlockManager shutdown priority is change from default 100 to temp_dir_prio + 1, so that it gets invoked just before all temp dirs are cleared. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #6482 from tdas/SPARK-7930 and squashes the following commits: d7cbeb5 [Tathagata Das] Removed unnecessary line 1514d0b [Tathagata Das] Fixed shutdown hook priorities
*	[SPARK-7932] Fix misleading scheduler delay visualization	Kay Ousterhout	2015-05-28	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The existing code rounds down to the nearest percent when computing the proportion of a task's time that was spent on each phase of execution, and then computes the scheduler delay proportion as 100 - sum(all other proportions). As a result, a few extra percent can end up in the scheduler delay. This commit eliminates the rounding so that the time visualizations correspond properly to the real times. sarutak If you could take a look at this, that would be great! Not sure if there's a good reason to round here that I missed. cc shivaram Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #6484 from kayousterhout/SPARK-7932 and squashes the following commits: 1723cc4 [Kay Ousterhout] [SPARK-7932] Fix misleading scheduler delay visualization
*	[SPARK-7926] [PYSPARK] use the official Pyrolite release	Xiangrui Meng	2015-05-28	1	-1/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Switch to the official Pyrolite release from the one published under `org.spark-project`. Thanks irmen for making the releases on Maven Central. We didn't upgrade to 4.6 because we don't have enough time for QA. I excludes `serpent` from its dependencies because we don't use it in Spark. ~~~ [info] +-net.jpountz.lz4:lz4:1.3.0 [info] +-net.razorvine:pyrolite:4.4 [info] +-net.sf.py4j:py4j:0.8.2.1 ~~~ davies Author: Xiangrui Meng <meng@databricks.com> Closes #6472 from mengxr/SPARK-7926 and squashes the following commits: 7b3c6bf [Xiangrui Meng] use the official Pyrolite release
*	[SPARK-7927] whitespace fixes for core.	Reynold Xin	2015-05-28	88	-203/+205
\| \| \| \| \| \| \| \| \| \| \|	So we can enable a whitespace enforcement rule in the style checker to save code review time. Author: Reynold Xin <rxin@databricks.com> Closes #6473 from rxin/whitespace-core and squashes the following commits: 058195d [Reynold Xin] Fixed tests. fce11e9 [Reynold Xin] [SPARK-7927] whitespace fixes for core.
*	[SPARK-7826] [CORE] Suppress extra calling getCacheLocs.	Takuya UESHIN	2015-05-28	2	-8/+42
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are too many extra call method `getCacheLocs` for `DAGScheduler`, which includes Akka communication. To improve `DAGScheduler` performance, suppress extra calling the method. In my application with over 1200 stages, the execution time became 3.8 min from 8.5 min with my patch. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #6352 from ueshin/issues/SPARK-7826 and squashes the following commits: 3d4d036 [Takuya UESHIN] Modify a test and the documentation. 10b1b22 [Takuya UESHIN] Simplify the unit test. d858b59 [Takuya UESHIN] Move the storageLevel check inside the if (!cacheLocs.contains(rdd.id)) block. 6f3125c [Takuya UESHIN] Fix scalastyle. b9c835c [Takuya UESHIN] Put the condition that checks if the RDD has uncached partition or not into variable for readability. f87f2ec [Takuya UESHIN] Get cached locations from block manager only if the storage level of the RDD is not StorageLevel.NONE. 8248386 [Takuya UESHIN] Revert "Suppress extra calling getCacheLocs." a4d944a [Takuya UESHIN] Add an unit test. 9a80fad [Takuya UESHIN] Suppress extra calling getCacheLocs.
*	Remove SizeEstimator from o.a.spark package.	Reynold Xin	2015-05-28	2	-47/+17
\| \| \| \| \| \| \| \| \| \| \|	See comments on https://github.com/apache/spark/pull/3913 Author: Reynold Xin <rxin@databricks.com> Closes #6471 from rxin/sizeestimator and squashes the following commits: c057095 [Reynold Xin] Fixed import. 2da478b [Reynold Xin] Remove SizeEstimator from o.a.spark package.
*	[SPARK-7782] fixed sort arrow issue	zuxqoj	2015-05-27	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Current behaviour:: In spark UI ![screen shot 2015-05-27 at 3 27 51 pm](https://cloud.githubusercontent.com/assets/3919211/7837541/47d330ba-04a5-11e5-89d1-e5b11da1a513.png) In YARN ![screen shot 2015-05-27 at 3](https://cloud.githubusercontent.com/assets/3919211/7837594/aebd1d36-04a5-11e5-8216-86e03c07d2bd.png) In jira ![screen shot 2015-05-27 at 3_2](https://cloud.githubusercontent.com/assets/3919211/7837616/d3fedce2-04a5-11e5-9e68-960ed54e5d83.png) Author: zuxqoj <sbshekhar@gmail.com> Closes #6437 from zuxqoj/SPARK-7782_PR and squashes the following commits: cd068b9 [zuxqoj] [SPARK-7782] fixed sort arrow issue
*	[SPARK-7896] Allow ChainedBuffer to store more than 2 GB	Sandy Ryza	2015-05-27	2	-42/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Author: Sandy Ryza <sandy@cloudera.com> Closes #6440 from sryza/sandy-spark-7896 and squashes the following commits: 49d8a0d [Sandy Ryza] Fix bug introduced when reading over record boundaries 6006856 [Sandy Ryza] Fix overflow issues 006b4b2 [Sandy Ryza] Fix scalastyle by removing non ascii characters 8b000ca [Sandy Ryza] Add ascii art to describe layout of data in metaBuffer f2053c0 [Sandy Ryza] Fix negative overflow issue 0368c78 [Sandy Ryza] Initialize size as 0 a5a4820 [Sandy Ryza] Use explicit types for all numbers in ChainedBuffer b7e0213 [Sandy Ryza] SPARK-7896. Allow ChainedBuffer to store more than 2 GB
*	[SPARK-7873] Allow KryoSerializerInstance to create multiple streams at the ↵	Josh Rosen	2015-05-27	3	-24/+147
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	same time This is a somewhat obscure bug, but I think that it will seriously impact KryoSerializer users who use custom registrators which disabled auto-reset. When auto-reset is disabled, then this breaks things in some of our shuffle paths which actually end up creating multiple OutputStreams from the same shared SerializerInstance (which is unsafe). This was introduced by a patch (SPARK-3386) which enables serializer re-use in some of the shuffle paths, since constructing new serializer instances is actually pretty costly for KryoSerializer. We had already fixed another corner-case (SPARK-7766) bug related to this, but missed this one. I think that the root problem here is that KryoSerializerInstance can be used in a way which is unsafe even within a single thread, e.g. by creating multiple open OutputStreams from the same instance or by interleaving deserialize and deserializeStream calls. I considered a smaller patch which adds assertions to guard against this type of "misuse" but abandoned that approach after I realized how convoluted the Scaladoc became. This patch fixes this bug by making it legal to create multiple streams from the same KryoSerializerInstance. Internally, KryoSerializerInstance now implements a `borrowKryo()` / `releaseKryo()` API that's backed by a "pool" of capacity 1. Each call to a KryoSerializerInstance method will borrow the Kryo, do its work, then release the serializer instance back to the pool. If the pool is empty and we need an instance, it will allocate a new Kryo on-demand. This makes it safe for multiple OutputStreams to be opened from the same serializer. If we try to release a Kryo back to the pool but the pool already contains a Kryo, then we'll just discard the new Kryo. I don't think there's a clear benefit to having a larger pool since our usages tend to fall into two cases, a) where we only create a single OutputStream and b) where we create a huge number of OutputStreams with the same lifecycle, then destroy the KryoSerializerInstance (this is what's happening in the bypassMergeSort code path that my regression test hits). Author: Josh Rosen <joshrosen@databricks.com> Closes #6415 from JoshRosen/SPARK-7873 and squashes the following commits: 00b402e [Josh Rosen] Initialize eagerly to fix a failing test ba55d20 [Josh Rosen] Add explanatory comments 3f1da96 [Josh Rosen] Guard against duplicate close() ab457ca [Josh Rosen] Sketch a loan/release based solution. 9816e8f [Josh Rosen] Add a failing test showing how deserialize() and deserializeStream() can interfere. 7350886 [Josh Rosen] Add failing regression test for SPARK-7873
*	[SPARK-7864] [UI] Fix the logic grabbing the link from table in AllJobPage	Kousuke Saruta	2015-05-27	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	This issue is related to #6419 . Now AllJobPage doesn't have a "kill link" but I think fix the issue mentioned in #6419 just in case to avoid accidents in the future. So, it's minor issue for now and I don't file this issue in JIRA. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #6432 from sarutak/remove-ambiguity-of-link and squashes the following commits: cd1a503 [Kousuke Saruta] Fixed ambiguity link issue in AllJobPage
*	[SPARK-7878] Rename Stage.jobId to firstJobId	Kay Ousterhout	2015-05-27	4	-37/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The previous name was confusing, because each stage can be associated with many jobs, and jobId is just the ID of the first job that was associated with the Stage. This commit also renames some of the method parameters in DAGScheduler.scala to clarify when the jobId refers to the first job ID associated with the stage (as opposed to the jobId associated with a job that's currently being scheduled). cc markhamstra JoshRosen (hopefully this will help prevent future bugs like SPARK-6880) Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #6418 from kayousterhout/SPARK-7878 and squashes the following commits: b71a9b8 [Kay Ousterhout] [SPARK-7878] Rename Stage.jobId to firstJobId
*	[CORE] [TEST] HistoryServerSuite failed due to timezone issue	scwf	2015-05-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	follow up for #6377 Change time to the equivalent in GMT /cc squito Author: scwf <wangfei1@huawei.com> Closes #6425 from scwf/fix-HistoryServerSuite and squashes the following commits: 4d37935 [scwf] fix HistoryServerSuite
*	[SPARK-7864] [UI] Do not kill innocent stages from visualization	Andrew Or	2015-05-26	3	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Reproduction. Run a long-running job, go to the job page, expand the DAG visualization, and click into a stage. Your stage is now killed. Why? This is because the visualization code just reaches into the stage table and grabs the first link it finds. In our case, this first link happens to be the kill link instead of the one to the stage page. Fix. Use proper CSS selectors to avoid ambiguity. This is an alternative to #6407. Thanks carsonwang for catching this. Author: Andrew Or <andrew@databricks.com> Closes #6419 from andrewor14/fix-ui-viz-kill and squashes the following commits: 25203bd [Andrew Or] Do not kill innocent stages
*	[SPARK-6602] [CORE] Remove some places in core that calling SparkEnv.actorSystem	zsxwing	2015-05-26	3	-12/+16
\| \| \| \| \| \| \| \| \| \|	Author: zsxwing <zsxwing@gmail.com> Closes #6333 from zsxwing/remove-actor-system-usage and squashes the following commits: f125aa6 [zsxwing] Fix YarnAllocatorSuite ceadcf6 [zsxwing] Change the "port" parameter type of "AkkaUtils.address" to "int"; update ApplicationMaster and YarnAllocator to get the driverUrl from RpcEnv 3239380 [zsxwing] Remove some places in core that calling SparkEnv.actorSystem
*	[SPARK-7854] [TEST] refine Kryo test suite	Zhang, Liye	2015-05-26	1	-24/+27
\| \| \| \| \| \| \| \| \| \|	this modification is according to JoshRosen 's comments, for details, please refer to [#5934](https://github.com/apache/spark/pull/5934/files#r30949751). Author: Zhang, Liye <liye.zhang@intel.com> Closes #6395 from liyezhang556520/kryoTest and squashes the following commits: da214c8 [Zhang, Liye] refine Kryo test suite accroding to Josh's comments
*	[CORE] [TEST] Fix SimpleDateParamTest	scwf	2015-05-26	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	``` sbt.ForkMain$ForkError: 1424424077190 was not equal to 1424474477190 at org.scalatest.MatchersHelper$.newTestFailedException(MatchersHelper.scala:160) at org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6231) at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6265) at org.apache.spark.status.api.v1.SimpleDateParamTest$$anonfun$1.apply$mcV$sp(SimpleDateParamTest.scala:25) at org.apache.spark.status.api.v1.SimpleDateParamTest$$anonfun$1.apply(SimpleDateParamTest.scala:23) at org.apache.spark.status.api.v1.SimpleDateParamTest$$anonfun$1.apply(SimpleDateParamTest.scala:23) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala: ``` Set timezone to fix SimpleDateParamTest Author: scwf <wangfei1@huawei.com> Author: Fei Wang <wangfei1@huawei.com> Closes #6377 from scwf/fix-SimpleDateParamTest and squashes the following commits: b8df1e5 [Fei Wang] Update SimpleDateParamSuite.scala 8bb74f0 [scwf] fix SimpleDateParamSuite
*	Add test which shows Kryo buffer size configured in mb is properly supported	tedyu	2015-05-25	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR adds test which shows that Kryo buffer size configured in mb is supported properly Author: tedyu <yuzhihong@gmail.com> Closes #6390 from tedyu/master and squashes the following commits: c51ea64 [tedyu] Fix KryoSerializer creation f12ee04 [tedyu] Correct conf variable name in test 642de51 [tedyu] Drop change in KryoSerializer so that the new test runs d2fdbc4 [tedyu] Give bufferSizeKb initial value 9a17277 [tedyu] Rewrite bufferSize checking 4739998 [tedyu] Rewrite bufferSize checking 830d0d0 [tedyu] Kryo buffer size configured in mb should be properly supported
*	[SPARK-7287] [HOTFIX] Disable o.a.s.deploy.SparkSubmitSuite --packages	Patrick Wendell	2015-05-23	1	-1/+2
\|
*	[HOTFIX] Add tests for SparkListenerApplicationStart with Driver Logs.	Hari Shreedharan	2015-05-22	1	-2/+23
\| \| \| \| \| \| \| \| \| \|	#6166 added the driver logs to `SparkListenerApplicationStart`. This adds tests in `JsonProtocolSuite` to ensure we don't regress. Author: Hari Shreedharan <hshreedharan@apache.org> Closes #6368 from harishreedharan/jsonprotocol-test and squashes the following commits: dc9eafc [Hari Shreedharan] [HOTFIX] Add tests for SparkListenerApplicationStart with Driver Logs.
*	[SPARK-7795] [CORE] Speed up task scheduling in standalone mode by reusing ↵	Akshat Aranya	2015-05-22	1	-4/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	serializer My experiments with scheduling very short tasks in standalone cluster mode indicated that a significant amount of time was being spent in scheduling the tasks (>500ms for 256 tasks). I found that most of the time was being spent in creating a new instance of serializer for each task. Changing this to just one serializer brought down the scheduling time to 8ms. Author: Akshat Aranya <aaranya@quantcast.com> Closes #6323 from coolfrood/master and squashes the following commits: 12d8c9e [Akshat Aranya] Reduce visibility of serializer bd4a5dd [Akshat Aranya] Style fix 0b8ca93 [Akshat Aranya] Incorporate review comments fe530cd [Akshat Aranya] Speed up task scheduling in standalone mode by reusing serializer instead of creating a new one for each task.
*	[SPARK-7224] [SPARK-7306] mock repository generator for --packages tests ↵	Burak Yavuz	2015-05-22	5	-100/+404
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	without nio.Path The previous PR for SPARK-7224 (#5790) broke JDK 6, because it used java.nio.Path, which was in jdk 7, and not in 6. This PR uses Guava's `Files` to handle directory creation, and etc... The description from the previous PR: > This patch contains an `IvyTestUtils` file, which dynamically generates jars and pom files to test the `--packages` feature without having to rely on the internet, and Maven Central. cc pwendell I also rand the flaky test about 20 times locally, it didn't fail a single time, but I think it may fail like once every 100 builds? I still haven't figured the cause yet, but the test before it, `--jars` was also failing after we turned off the `--packages` test in `SparkSubmitSuite`. It may be related to the launch of SparkSubmit. Author: Burak Yavuz <brkyvz@gmail.com> Closes #5892 from brkyvz/maven-utils and squashes the following commits: e9b1903 [Burak Yavuz] fix merge conflict 68214e0 [Burak Yavuz] remove ignore for test(neglect spark dependencies) e632381 [Burak Yavuz] fix ignore 9ef1408 [Burak Yavuz] re-enable --packages test 22eea62 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into maven-utils 05cd0de [Burak Yavuz] added mock repository generator
*	[SPARK-7771] [SPARK-7779] Dynamic allocation: lower default timeouts further	Andrew Or	2015-05-22	1	-8/+18
\| \| \| \| \| \| \| \| \| \| \| \| \|	The default add time of 5s is still too slow for small jobs. Also, the current default remove time of 10 minutes seem rather high. This patch lowers both and rephrases a few log messages. Author: Andrew Or <andrew@databricks.com> Closes #6301 from andrewor14/da-minor and squashes the following commits: 6d614a6 [Andrew Or] Lower log level 2811492 [Andrew Or] Log information when requests are canceled 5fcd3eb [Andrew Or] Fix tests 3320710 [Andrew Or] Lower timeouts + rephrase a few log messages
*	[SPARK-7760] add /json back into master & worker pages; add test	Imran Rashid	2015-05-22	3	-3/+37
\| \| \| \| \| \| \| \| \| \|	Author: Imran Rashid <irashid@cloudera.com> Closes #6284 from squito/SPARK-7760 and squashes the following commits: 5e02d8a [Imran Rashid] style; increase timeout 9987399 [Imran Rashid] comment 8c7ed63 [Imran Rashid] add /json back into master & worker pages; add test
*	[SPARK-7758] [SQL] Override more configs to avoid failure when connect to a ↵	WangTaoTheTonic	2015-05-22	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	postgre sql https://issues.apache.org/jira/browse/SPARK-7758 When initializing `executionHive`, we only masks `javax.jdo.option.ConnectionURL` to override metastore location. However, other properties that relates to the actual Hive metastore data source are not masked. For example, when using Spark SQL with a PostgreSQL backed Hive metastore, `executionHive` actually tries to use settings read from `hive-site.xml`, which talks about PostgreSQL, to connect to the temporary Derby metastore, thus causes error. To fix this, we need to mask all metastore data source properties. Specifically, according to the code of [Hive `ObjectStore.getDataSourceProps()` method] [1], all properties whose name mentions "jdo" and "datanucleus" must be included. [1]: https://github.com/apache/hive/blob/release-0.13.1/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L288 Have tested using postgre sql as metastore, it worked fine. Author: WangTaoTheTonic <wangtao111@huawei.com> Closes #6314 from WangTaoTheTonic/SPARK-7758 and squashes the following commits: ca7ae7c [WangTaoTheTonic] add comments 86caf2c [WangTaoTheTonic] delete unused import e4f0feb [WangTaoTheTonic] block more data source related property 92a81fa [WangTaoTheTonic] fix style check e3e683d [WangTaoTheTonic] override more configs to avoid failuer connecting to postgre sql
*	[SPARK-7766] KryoSerializerInstance reuse is unsafe when auto-reset is disabled	Josh Rosen	2015-05-22	2	-0/+35
\| \| \| \| \| \| \| \| \| \| \| \| \|	SPARK-3386 / #5606 modified the shuffle write path to re-use serializer instances across multiple calls to DiskBlockObjectWriter. It turns out that this introduced a very rare bug when using `KryoSerializer`: if auto-reset is disabled and reference-tracking is enabled, then we'll end up re-using the same serializer instance to write multiple output streams without calling `reset()` between write calls, which can lead to cases where objects in one file may contain references to objects that are in previous files, causing errors during deserialization. This patch fixes this bug by calling `reset()` at the start of `serialize()` and `serializeStream()`. I also added a regression test which demonstrates that this problem only occurs when auto-reset is disabled and reference-tracking is enabled. Author: Josh Rosen <joshrosen@databricks.com> Closes #6293 from JoshRosen/kryo-instance-reuse-bug and squashes the following commits: e19726d [Josh Rosen] Add fix for SPARK-7766. 71845e3 [Josh Rosen] Add failing regression test to trigger Kryo re-use bug
*	[SPARK-7657] [YARN] Add driver logs links in application UI, in cluster mode.	Hari Shreedharan	2015-05-21	5	-6/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR adds the URLs to the driver logs to `SparkListenerApplicationStarted` event, which is later used by the `ExecutorsListener` to populate the URLs to the driver logs in its own state. This info is then used when the UI is rendered to display links to the logs. Author: Hari Shreedharan <hshreedharan@apache.org> Closes #6166 from harishreedharan/am-log-link and squashes the following commits: 943fc4f [Hari Shreedharan] Merge remote-tracking branch 'asf/master' into am-log-link 9e5c04b [Hari Shreedharan] Merge remote-tracking branch 'asf/master' into am-log-link b3f9b9d [Hari Shreedharan] Updated comment based on feedback. 0840a95 [Hari Shreedharan] Move the result and sc.stop back to original location, minor import changes. 537a2f7 [Hari Shreedharan] Add test to ensure the log urls are populated and valid. 4033725 [Hari Shreedharan] Adding comments explaining how node reports are used to get the log urls. 6c5c285 [Hari Shreedharan] Import order. 346f4ea [Hari Shreedharan] Review feedback fixes. 629c1dc [Hari Shreedharan] Cleanup. 99fb1a3 [Hari Shreedharan] Send the log urls in App start event, to ensure that other listeners are not affected. c0de336 [Hari Shreedharan] Ensure new unit test cleans up after itself. 50cdae3 [Hari Shreedharan] Added unit test, made the approach generic. 402e8e4 [Hari Shreedharan] Use `NodeReport` to get the URL for the logs. Also, make the environment variables generic so other cluster managers can use them as well. 1cf338f [Hari Shreedharan] [SPARK-7657][YARN] Add driver link in application UI, in cluster mode.
*	[SPARK-7718] [SQL] Speed up partitioning by avoiding closure cleaning	Andrew Or	2015-05-21	1	-0/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	According to yhuai we spent 6-7 seconds cleaning closures in a partitioning job that takes 12 seconds. Since we provide these closures in Spark we know for sure they are serializable, so we can bypass the cleaning. Author: Andrew Or <andrew@databricks.com> Closes #6256 from andrewor14/sql-partition-speed-up and squashes the following commits: a82b451 [Andrew Or] Fix style 10f7e3e [Andrew Or] Avoid getting call sites and cleaning closures 17e2943 [Andrew Or] Merge branch 'master' of github.com:apache/spark into sql-partition-speed-up 523f042 [Andrew Or] Skip unnecessary Utils.getCallSites too f7fe143 [Andrew Or] Avoid unnecessary closure cleaning
*	[SPARK-6416] [DOCS] RDD.fold() requires the operator to be commutative	Sean Owen	2015-05-21	2	-6/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Document current limitation of rdd.fold. This does not resolve SPARK-6416 but just documents the issue. CC JoshRosen Author: Sean Owen <sowen@cloudera.com> Closes #6231 from srowen/SPARK-6416 and squashes the following commits: 9fef39f [Sean Owen] Add comment to other languages; reword to highlight the difference from non-distributed collections and to not suggest it is a bug that is to be fixed da40d84 [Sean Owen] Document current limitation of rdd.fold.
*	[SPARK-7389] [CORE] Tachyon integration improvement	Mingfei	2015-05-20	4	-48/+149
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Two main changes: Add two functions in ExternalBlockManager, which are putValues and getValues because the implementation may not rely on the putBytes and getBytes improve Tachyon integration. Currently, when putting data into Tachyon, Spark first serialize all data in one partition into a ByteBuffer, and then write into Tachyon, this will uses much memory and increase GC overhead when get data from Tachyon, getValues depends on getBytes, which also read all data into On heap byte arry, and result in much memory usage. This PR changes the approach of the two functions, make them read / write data by stream to reduce memory usage. In our testing, when data size is huge, this patch reduces about 30% GC time and 70% full GC time, and total execution time reduces about 10% Author: Mingfei <mingfei.shi@intel.com> Closes #5908 from shimingfei/Tachyon-integration-rebase and squashes the following commits: 033bc57 [Mingfei] modify accroding to comments 747c69a [Mingfei] modify according to comments - format changes ce52c67 [Mingfei] put close() in a finally block d2c60bb [Mingfei] modify according to comments, some code style change 4c11591 [Mingfei] modify according to comments split putIntoExternalBlockStore into two functions add default implementation for getValues and putValues cc0a32e [Mingfei] Make getValues read data from Tachyon by stream Make putValues write data to Tachyon by stream 017593d [Mingfei] add getValues and putValues in ExternalBlockManager's Interface
*	[SPARK-7750] [WEBUI] Rename endpoints from `json` to `api` to allow fu…	Hari Shreedharan	2015-05-20	6	-20/+23
\| \| \| \| \| \| \| \| \| \| \| \|	…rther extension to non-json outputs too. Author: Hari Shreedharan <hshreedharan@apache.org> Closes #6273 from harishreedharan/json-to-api and squashes the following commits: e14b73b [Hari Shreedharan] Rename `getJsonServlet` to `getServletHandler` i 42f8acb [Hari Shreedharan] Import order fixes. 2ef852f [Hari Shreedharan] [SPARK-7750][WebUI] Rename endpoints from `json` to `api` to allow further extension to non-json outputs too.