spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	[SPARK-13178] RRDD faces with concurrency issue in case of rdd.zip(rdd).count().	Sun Rui	2016-04-22	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The concurrency issue reported in SPARK-13178 was fixed by the PR https://github.com/apache/spark/pull/10947 for SPARK-12792. This PR just removes a workaround not needed anymore. ## How was this patch tested? SparkR unit tests. Author: Sun Rui <rui.sun@intel.com> Closes #12606 from sun-rui/SPARK-13178.
*	[SPARK-14841][SQL] Move SQLBuilder into sql/core	Reynold Xin	2016-04-22	8	-19/+19
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch moves SQLBuilder into sql/core so we can in the future move view generation also into sql/core. ## How was this patch tested? Also moved unit tests. Author: Reynold Xin <rxin@databricks.com> Author: Wenchen Fan <wenchen@databricks.com> Closes #12602 from rxin/SPARK-14841.
*	[SPARK-14843][ML] Fix encoding error in LibSVMRelation	Liang-Chi Hsieh	2016-04-23	2	-5/+13
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We use `RowEncoder` in libsvm data source to serialize the label and features read from libsvm files. However, the schema passed in this encoder is not correct. As the result, we can't correctly select `features` column from the DataFrame. We should use full data schema instead of `requiredSchema` to serialize the data read in. Then do projection to select required columns later. ## How was this patch tested? `LibSVMRelationSuite`. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #12611 from viirya/fix-libsvm.
*	[SPARK-10001] Consolidate Signaling and SignalLogger.	Reynold Xin	2016-04-22	5	-77/+58
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This is a follow-up to #12557, with the following changes: 1. Fixes some of the style issues. 2. Merges Signaling and SignalLogger into a new class called SignalUtils. It was pretty confusing to have Signaling and Signal in one file, and it was also confusing to have two classes named Signaling and one called the other. 3. Made logging registration idempotent. ## How was this patch tested? N/A. Author: Reynold Xin <rxin@databricks.com> Closes #12605 from rxin/SPARK-10001.
*	[SPARK-13266] [SQL] None read/writer options were not transalated to "null"	Liang-Chi Hsieh	2016-04-22	3	-4/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? In Python, the `option` and `options` method of `DataFrameReader` and `DataFrameWriter` were sending the string "None" instead of `null` when passed `None`, therefore making it impossible to send an actual `null`. This fixes that problem. This is based on #11305 from mathieulongtin. ## How was this patch tested? Added test to readwriter.py. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: mathieu longtin <mathieu.longtin@nuance.com> Closes #12494 from viirya/py-df-none-option.
*	[SPARK-14848][SQL] Compare as Set in DatasetSuite - Java encoder	Pete Robbins	2016-04-22	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Change test to compare sets rather than sequence ## How was this patch tested? Full test runs on little endian and big endian platforms Author: Pete Robbins <robbinspg@gmail.com> Closes #12610 from robbinspg/DatasetSuiteFix.
*	[MINOR][DOC] Fix doc style in ml.ann.Layer and MultilayerPerceptronClassifier	Zheng RuiFeng	2016-04-22	2	-40/+40
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? 1, fix the indentation 2, add a missing param desc ## How was this patch tested? unit tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12499 from zhengruifeng/fix_doc.
*	[SPARK-6429] Implement hashCode and equals together	Joan	2016-04-22	32	-40/+136
\| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Implement some `hashCode` and `equals` together in order to enable the scalastyle. This is a first batch, I will continue to implement them but I wanted to know your thoughts. Author: Joan <joan@goyeau.com> Closes #12157 from joan38/SPARK-6429-HashCode-Equals.
*	[SPARK-14609][SQL] Native support for LOAD DATA DDL command	Liang-Chi Hsieh	2016-04-22	11	-8/+427
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add the native support for LOAD DATA DDL command that loads data into Hive table/partition. ## How was this patch tested? `HiveDDLCommandSuite` and `HiveQuerySuite`. Besides, few Hive tests (`WindowQuerySuite`, `HiveTableScanSuite` and `HiveSerDeSuite`) also use `LOAD DATA` command. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #12412 from viirya/ddl-load-data.
*	[SPARK-14826][SQL] Remove HiveQueryExecution	Reynold Xin	2016-04-22	20	-436/+420
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch removes HiveQueryExecution. As part of this, I consolidated all the describe commands into DescribeTableCommand. ## How was this patch tested? Should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #12588 from rxin/SPARK-14826.
*	[SPARK-10001] [CORE] Interrupt tasks in repl with Ctrl+C	Jakob Odersky	2016-04-21	5	-28/+147
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Improve signal handling to allow interrupting running tasks from the REPL (with Ctrl+C). If no tasks are running or Ctrl+C is pressed twice, the signal is forwarded to the default handler resulting in the usual termination of the application. This PR is a rewrite of -- and therefore closes #8216 -- as per piaozhexiu's request ## How was this patch tested? Signal handling is not easily testable therefore no unit tests were added. Nevertheless, the new functionality is implemented in a best-effort approach, soft-failing in case signals aren't available on a specific OS. Author: Jakob Odersky <jakob@odersky.com> Closes #12557 from jodersky/SPARK-10001-sigint.
*	[SPARK-14835][SQL] Remove MetastoreRelation dependency from SQLBuilder	Reynold Xin	2016-04-21	2	-8/+22
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch removes SQLBuilder's dependency on MetastoreRelation. We should be able to move SQLBuilder into the sql/core package after this change. ## How was this patch tested? N/A - covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #12594 from rxin/SPARK-14835.
*	[SPARK-14369] [SQL] Locality support for FileScanRDD	Cheng Lian	2016-04-21	6	-37/+291
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	(This PR is a rebased version of PR #12153.) ## What changes were proposed in this pull request? This PR adds preliminary locality support for `FileFormat` data sources by overriding `FileScanRDD.preferredLocations()`. The strategy can be divided into two parts: 1. Block location lookup Unlike `HadoopRDD` or `NewHadoopRDD`, `FileScanRDD` doesn't have access to the underlying `InputFormat` or `InputSplit`, and thus can't rely on `InputSplit.getLocations()` to gather locality information. Instead, this PR queries block locations using `FileSystem.getBlockLocations()` after listing all `FileStatus`es in `HDFSFileCatalog` and convert all `FileStatus`es into `LocatedFileStatus`es. Note that although S3/S3A/S3N file systems don't provide valid locality information, their `getLocatedStatus()` implementations don't actually issue remote calls either. So there's no need to special case these file systems. 2. Selecting preferred locations For each `FilePartition`, we pick up top 3 locations that containing the most data to be retrieved. This isn't necessarily the best algorithm out there. Further improvements may be brought up in follow-up PRs. ## How was this patch tested? Tested by overriding default `FileSystem` implementation for `file:///` with a mocked one, which returns mocked block locations. Author: Cheng Lian <lian@databricks.com> Closes #12527 from liancheng/spark-14369-locality-rebased.
*	[SPARK-14680] [SQL] Support all datatypes to use VectorizedHashmap in ↵	Sameer Agarwal	2016-04-21	5	-39/+322
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TungstenAggregate ## What changes were proposed in this pull request? This PR adds support for all primitive datatypes, decimal types and stringtypes in the VectorizedHashmap during aggregation. ## How was this patch tested? Existing tests for group-by aggregates should already test for all these datatypes. Additionally, manually inspected the generated code for all supported datatypes (details below). Author: Sameer Agarwal <sameer@databricks.com> Closes #12440 from sameeragarwal/all-datatypes.
*	[SPARK-14793] [SQL] Code generation for large complex type exceeds JVM size ↵	Takuya UESHIN	2016-04-21	3	-53/+144
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	limit. ## What changes were proposed in this pull request? Code generation for complex type, `CreateArray`, `CreateMap`, `CreateStruct`, `CreateNamedStruct`, exceeds JVM size limit for large elements. We should split generated code into multiple `apply` functions if the complex types have large elements, like `UnsafeProjection` or others for large expressions. ## How was this patch tested? I added some tests to check if the generated codes for the expressions exceed or not. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #12559 from ueshin/issues/SPARK-14793.
*	[SPARK-14824][SQL] Rename HiveContext object to HiveUtils	Andrew Or	2016-04-21	20	-55/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Just a rename so we can get rid of `HiveContext.scala`. Note that this will conflict with #12585. ## How was this patch tested? No change in functionality. Author: Andrew Or <andrew@databricks.com> Closes #12586 from andrewor14/rename-hc-object.
*	[HOTFIX] Fix Java 7 compilation break	Reynold Xin	2016-04-21	4	-11/+6
\|
*	[SPARK-14821][SQL] Implement AnalyzeTable in sql/core and remove ↵	Reynold Xin	2016-04-21	13	-226/+199
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	HiveSqlAstBuilder ## What changes were proposed in this pull request? This patch moves analyze table parsing into SparkSqlAstBuilder and removes HiveSqlAstBuilder. In order to avoid extensive refactoring, I created a common trait for CatalogRelation and MetastoreRelation, and match on that. In the future we should probably just consolidate the two into a single thing so we don't need this common trait. ## How was this patch tested? Updated unit tests. Author: Reynold Xin <rxin@databricks.com> Closes #12584 from rxin/SPARK-14821.
*	[SPARK-14479][ML] GLM supports output link prediction	Yanbo Liang	2016-04-21	2	-34/+108
\| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? GLM supports output link prediction. ## How was this patch tested? unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #12287 from yanboliang/spark-14479.
*	[SPARK-14734][ML][MLLIB] Added asML, fromML methods for all spark.mllib ↵	Joseph K. Bradley	2016-04-21	5	-2/+139
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Vector, Matrix types ## What changes were proposed in this pull request? For maintaining wrappers around spark.mllib algorithms in spark.ml, it will be useful to have ```private[spark]``` methods for converting from one linear algebra representation to another. This PR adds toNew, fromNew methods for all spark.mllib Vector and Matrix types. ## How was this patch tested? Unit tests for all conversions Author: Joseph K. Bradley <joseph@databricks.com> Closes #12504 from jkbradley/linalg-conversions.
*	[SPARK-14724] Use radix sort for shuffles and sort operator when possible	Eric Liang	2016-04-21	24	-119/+876
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Spark currently uses TimSort for all in-memory sorts, including sorts done for shuffle. One low-hanging fruit is to use radix sort when possible (e.g. sorting by integer keys). This PR adds a radix sort implementation to the unsafe sort package and switches shuffles and sorts to use it when possible. The current implementation does not have special support for null values, so we cannot radix-sort `LongType`. I will address this in a follow-up PR. ## How was this patch tested? Unit tests, enabling radix sort on existing tests. Microbenchmark results: ``` Running benchmark: radix sort 25000000 Java HotSpot(TM) 64-Bit Server VM 1.8.0_66-b17 on Linux 3.13.0-44-generic Intel(R) Core(TM) i7-4600U CPU 2.10GHz radix sort 25000000: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- reference TimSort key prefix array 15546 / 15859 1.6 621.9 1.0X reference Arrays.sort 2416 / 2446 10.3 96.6 6.4X radix sort one byte 133 / 137 188.4 5.3 117.2X radix sort two bytes 255 / 258 98.2 10.2 61.1X radix sort eight bytes 991 / 997 25.2 39.6 15.7X radix sort key prefix array 1540 / 1563 16.2 61.6 10.1X ``` I also ran a mix of the supported TPCDS queries and compared TimSort vs RadixSort metrics. The overall benchmark ran ~10% faster with radix sort on. In the breakdown below, the radix-enabled sort phases averaged about 20x faster than TimSort, however sorting is only a small fraction of the overall runtime. About half of the TPCDS queries were able to take advantage of radix sort. ``` TPCDS on master: 2499s real time, 8185s executor - 1171s in TimSort, avg 267 MB/s (note the /s accounting is weird here since dataSize counts the record sizes too) TPCDS with radix enabled: 2294s real time, 7391s executor - 596s in TimSort, avg 254 MB/s - 26s in radix sort, avg 4.2 GB/s ``` cc davies rxin Author: Eric Liang <ekl@databricks.com> Closes #12490 from ericl/sort-benchmark.
*	[SPARK-14569][ML] Log instrumentation in KMeans	Xin Ren	2016-04-21	3	-6/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-14569 Log instrumentation in KMeans: - featuresCol - predictionCol - k - initMode - initSteps - maxIter - seed - tol - summary ## How was this patch tested? Manually test on local machine, by running and checking output of org.apache.spark.examples.ml.KMeansExample Author: Xin Ren <iamshrek@126.com> Closes #12432 from keypointt/SPARK-14569.
*	[SPARK-14780] [R] Add `setLogLevel` to SparkR	Dongjoon Hyun	2016-04-21	3	-0/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR aims to add `setLogLevel` function to SparkR shell. Spark Shell ```scala scala> sc.setLogLevel("ERROR") ``` PySpark ```python >>> sc.setLogLevel("ERROR") ``` SparkR (this PR) ```r > setLogLevel(sc, "ERROR") NULL ``` ## How was this patch tested? Pass the Jenkins tests including a new R testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12547 from dongjoon-hyun/SPARK-14780.
*	[SPARK-14774][SQL] Write unscaled values in ColumnVector.putDecimal	Sameer Agarwal	2016-04-21	3	-30/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We recently made `ColumnarBatch.row` mutable and added a new `ColumnVector.putDecimal` method to support putting `Decimal` values in the `ColumnarBatch`. This unfortunately introduced a bug wherein we were not updating the vector with the proper unscaled values. ## How was this patch tested? This codepath is hit only when the vectorized aggregate hashmap is enabled. https://github.com/apache/spark/pull/12440 makes sure that a number of regression tests/benchmarks test this bugfix. Author: Sameer Agarwal <sameer@databricks.com> Closes #12541 from sameeragarwal/fix-bigdecimal.
*	[SPARK-14798][SQL] Move native command and script transformation parsing ↵	Reynold Xin	2016-04-21	15	-182/+192
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	into SparkSqlAstBuilder ## What changes were proposed in this pull request? This patch moves native command and script transformation into SparkSqlAstBuilder. This builds on #12561. See the last commit for diff. ## How was this patch tested? Updated test cases to reflect this. Author: Reynold Xin <rxin@databricks.com> Closes #12564 from rxin/SPARK-14798.
*	[MINOR] Comment whitespace changes in #12553	Andrew Or	2016-04-21	1	-9/+10
\|
*	[SPARK-13643][SQL] Implement SparkSession	Andrew Or	2016-04-21	6	-197/+964
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? After removing most of `HiveContext` in 8fc267ab3322e46db81e725a5cb1adb5a71b2b4d we can now move existing functionality in `SQLContext` to `SparkSession`. As of this PR `SQLContext` becomes a simple wrapper that has a `SparkSession` and delegates all functionality to it. ## How was this patch tested? Jenkins. Author: Andrew Or <andrew@databricks.com> Closes #12553 from andrewor14/implement-spark-session.
*	[SPARK-14801][SQL] Move MetastoreRelation to its own file	Reynold Xin	2016-04-21	2	-205/+232
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This class is currently in HiveMetastoreCatalog.scala, which is a large file that makes refactoring and searching of usage difficult. Moving it out so I can then do SPARK-14799 and make the review of that simpler. ## How was this patch tested? N/A - this is a straightforward move and should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #12567 from rxin/SPARK-14801.
*	[SPARK-14699][CORE] Stop endpoints before closing the connections and don't ↵	Shixiong Zhu	2016-04-21	3	-8/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	stop client in Outbox ## What changes were proposed in this pull request? In general, `onDisconnected` is for dealing with unexpected network disconnections. When RpcEnv.shutdown is called, the disconnections are expected so RpcEnv should not fire these events. This PR moves `dispatcher.stop()` above closing the connections so that when stopping RpcEnv, the endpoints won't receive `onDisconnected` events. In addition, Outbox should not close the client since it will be reused by others. This PR fixes it as well. ## How was this patch tested? test("SPARK-14699: RpcEnv.shutdown should not fire onDisconnected events") Author: Shixiong Zhu <shixiong@databricks.com> Closes #12481 from zsxwing/SPARK-14699.
*	[SPARK-14795][SQL] Remove the use of Hive's variable substitution	Reynold Xin	2016-04-21	3	-11/+8
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch builds on #12556 and completely removes the use of Hive's variable substitution. ## How was this patch tested? Covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #12561 from rxin/SPARK-14795.
*	[SPARK-14799][SQL] Remove MetastoreRelation dependency from AnalyzeTable - ↵	Reynold Xin	2016-04-21	1	-26/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	part 1 ## What changes were proposed in this pull request? This patch isolates AnalyzeTable's dependency on MetastoreRelation into a single line. After this we can work on converging MetastoreRelation and CatalogTable. ## How was this patch tested? Covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #12566 from rxin/SPARK-14799.
*	[SPARK-14783] Preserve full exception stacktrace in IsolatedClientLoader	Josh Rosen	2016-04-21	1	-1/+1
\| \| \| \| \| \| \| \|	In IsolatedClientLoader, we have a`catch` block which throws an exception without wrapping the original exception, causing the full exception stacktrace and any nested exceptions to be lost. This patch fixes this, improving the usefulness of classloading error messages. Author: Josh Rosen <joshrosen@databricks.com> Closes #12548 from JoshRosen/improve-logging-for-hive-classloader-issues.
*	[SPARK-4452] [CORE] Shuffle data structures can starve others on the same ↵	Lianhui Wang	2016-04-21	8	-46/+324
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	thread for memory ## What changes were proposed in this pull request? In #9241 It implemented a mechanism to call spill() on those SQL operators that support spilling if there is not enough memory for execution. But ExternalSorter and AppendOnlyMap in Spark core are not worked. So this PR make them benefit from #9241. Now when there is not enough memory for execution, it can get memory by spilling ExternalSorter and AppendOnlyMap in Spark core. ## How was this patch tested? add two unit tests for it. Author: Lianhui Wang <lianhuiwang09@gmail.com> Closes #10024 from lianhuiwang/SPARK-4452-2.
*	[SPARK-14797][BUILD] Spark SQL POM should not hardcode spark-sketch_2.11 dep.	Josh Rosen	2016-04-21	2	-1/+51
\| \| \| \| \| \| \| \| \| \| \| \|	Spark SQL's POM hardcodes a dependency on `spark-sketch_2.11`, which causes Scala 2.10 builds to include the `_2.11` dependency. This is harmless since `spark-sketch` is a pure-Java module (see #12334 for a discussion of dropping the Scala version suffixes from these modules' artifactIds), but it's confusing to people looking at the published POMs. This patch fixes this by using `${scala.binary.version}` to substitute the correct suffix, and also adds a set of Maven Enforcer rules to ensure that `_2.11` artifacts are not used in 2.10 builds (and vice-versa). /cc ahirreddy, who spotted this issue. Author: Josh Rosen <joshrosen@databricks.com> Closes #12563 from JoshRosen/fix-sketch-scala-version.
*	[SPARK-13988][CORE] Make replaying event logs multi threaded in Histo…ry ↵	Parth Brahmbhatt	2016-04-21	2	-43/+56
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	server to ensure a single large log does not block other logs from being rendered. ## What changes were proposed in this pull request? The patch makes event log processing multi threaded. ## How was this patch tested? Existing tests pass, there is no new tests needed to test the functionality as this is a perf improvement. I tested the patch locally by generating one big event log (big1), one small event log(small1) and again a big event log(big2). Without this patch UI does not render any app for almost 30 seconds and then big2 and small1 appears. another 30 second delay and finally big1 also shows up in UI. With this change small1 shows up immediately and big1 and big2 comes up in 30 seconds. Locally it also displays them in the correct order in the UI. Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com> Closes #11800 from Parth-Brahmbhatt/SPARK-13988.
*	[HOTFIX] Remove wrong DDL tests	Liang-Chi Hsieh	2016-04-21	1	-13/+0
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? As we moved most parsing rules to `SparkSqlParser`, some tests expected to throw exception are not correct anymore. ## How was this patch tested? `DDLCommandSuite` Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #12572 from viirya/hotfix-ddl.
*	[SPARK-14779][CORE] Corrected log message in Worker case KillExecutor	Bryan Cutler	2016-04-21	1	-1/+1
\| \| \| \| \| \| \| \|	In o.a.s.deploy.worker.Worker.scala, when receiving a KillExecutor message from an invalid Master, fixed typo by changing the log message to read "..attemped to kill executor.." Author: Bryan Cutler <cutlerb@gmail.com> Closes #12546 from BryanCutler/worker-killexecutor-log-message.
*	[SPARK-14787][SQL] Upgrade Joda-Time library from 2.9 to 2.9.3	hyukjinkwon	2016-04-21	6	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-14787 The possible problems are described in the JIRA above. Please refer this if you are wondering the purpose of this PR. This PR upgrades Joda-Time library from 2.9 to 2.9.3. ## How was this patch tested? `sbt scalastyle` and Jenkins tests in this PR. closes #11847 Author: hyukjinkwon <gurwls223@gmail.com> Closes #12552 from HyukjinKwon/SPARK-14787.
*	[SPARK-14739][PYSPARK] Fix Vectors parser bugs	Arash Parsa	2016-04-21	2	-8/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The PySpark deserialization has a bug that shows while deserializing all zero sparse vectors. This fix filters out empty string tokens before casting, hence properly stringified SparseVectors successfully get parsed. ## How was this patch tested? Standard unit-tests similar to other methods. Author: Arash Parsa <arash@ip-192-168-50-106.ec2.internal> Author: Arash Parsa <arashpa@gmail.com> Author: Vishnu Prasad <vishnu667@gmail.com> Author: Vishnu Prasad S <vishnu667@gmail.com> Closes #12516 from arashpa/SPARK-14739.
*	[SPARK-8393][STREAMING] JavaStreamingContext#awaitTermination() throws ↵	Sean Owen	2016-04-21	11	-14/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	non-declared InterruptedException ## What changes were proposed in this pull request? `JavaStreamingContext.awaitTermination` methods should be declared as `throws[InterruptedException]` so that this exception can be handled in Java code. Note this is not just a doc change, but an API change, since now (in Java) the method has a checked exception to handle. All await-like methods in Java APIs behave this way, so seems worthwhile for 2.0. ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #12418 from srowen/SPARK-8393.
*	[SPARK-14753][CORE] remove internal flag in Accumulable	Wenchen Fan	2016-04-21	14	-122/+98
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? the `Accumulable.internal` flag is only used to avoid registering internal accumulators for 2 certain cases: 1. `TaskMetrics.createTempShuffleReadMetrics`: the accumulators in the temp shuffle read metrics should not be registered. 2. `TaskMetrics.fromAccumulatorUpdates`: the created task metrics is only used to post event, accumulators inside it should not be registered. For 1, we can create a `TempShuffleReadMetrics` that don't create accumulators, just keep the data and merge it at last. For 2, we can un-register these accumulators immediately. TODO: remove `internal` flag in `AccumulableInfo` with followup PR ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12525 from cloud-fan/acc.
*	[SPARK-14794][SQL] Don't pass analyze command into Hive	Reynold Xin	2016-04-21	2	-6/+8
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We shouldn't pass analyze command to Hive because some of those would require running MapReduce jobs. For now, let's just always run the no scan analyze. ## How was this patch tested? Updated test case to reflect this change. Author: Reynold Xin <rxin@databricks.com> Closes #12558 from rxin/parser-analyze.
*	[HOTFIX] Disable flaky tests	Reynold Xin	2016-04-21	1	-2/+2
\|
*	[SPARK-14792][SQL] Move as many parsing rules as possible into SQL parser	Reynold Xin	2016-04-21	14	-489/+568
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch moves as many parsing rules as possible into SQL parser. There are only three more left after this patch: (1) run native command, (2) analyze, and (3) script IO. These 3 will be dealt with in a follow-up PR. ## How was this patch tested? No test change. This simply moves code around. Author: Reynold Xin <rxin@databricks.com> Closes #12556 from rxin/SPARK-14792.
*	[SPARK-14786] Remove hive-cli dependency from hive subproject	Josh Rosen	2016-04-20	3	-7/+33
\| \| \| \| \| \| \| \| \| \| \| \|	The `hive` subproject currently depends on `hive-cli` in order to perform a check to see whether a `SessionState` is an instance of `org.apache.hadoop.hive.cli.CliSessionState` (see #9589). The introduction of this `hive-cli` dependency has caused problems for users whose Hive metastore JAR classpaths don't include the `hive-cli` classes (such as in #11495). This patch removes this dependency on `hive-cli` and replaces the `isInstanceOf` check by reflection. I added a Maven Enforcer rule to ban `hive-cli` from the `hive` subproject in order to make sure that this dependency is not accidentally reintroduced. /cc rxin yhuai adrian-wang preecet Author: Josh Rosen <joshrosen@databricks.com> Closes #12551 from JoshRosen/remove-hive-cli-dep-from-hive-subproject.
*	[SPARK-14782][SPARK-14778][SQL] Remove HiveConf dependency from ↵	Reynold Xin	2016-04-20	4	-40/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	HiveSqlAstBuilder ## What changes were proposed in this pull request? The patch removes HiveConf dependency from HiveSqlAstBuilder. This is required in order to merge HiveSqlParser and SparkSqlAstBuilder, which would require getting rid of the Hive specific dependencies in HiveSqlParser. This patch also accomplishes [SPARK-14778] Remove HiveSessionState.substitutor. ## How was this patch tested? This should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #12550 from rxin/SPARK-14782.
*	[HOTFIX] Ignore all Docker integration tests	Josh Rosen	2016-04-20	3	-0/+9
\| \| \| \| \| \| \| \|	The Docker integration tests are failing very often (https://spark-tests.appspot.com/failed-tests) so I think we should disable these suites for now until we have time to improve them. Author: Josh Rosen <joshrosen@databricks.com> Closes #12549 from JoshRosen/ignore-all-docker-tests.
*	[SPARK-14775][SQL] Remove TestHiveSparkSession.rewritePaths	Reynold Xin	2016-04-20	4	-22/+17
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The path rewrite in TestHiveSparkSession is pretty hacky. I think we can remove those complexity and just do a string replacement when we read the query files in. This would remove the overloading of runNativeSql in TestHive, which will simplify the removal of Hive specific variable substitution. ## How was this patch tested? This is a small test refactoring to simplify test infrastructure. Author: Reynold Xin <rxin@databricks.com> Closes #12543 from rxin/SPARK-14775.
*	[SPARK-14602][YARN] Use SparkConf to propagate the list of cached files.	Marcelo Vanzin	2016-04-20	11	-175/+239
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This change avoids using the environment to pass this information, since with many jars it's easy to hit limits on certain OSes. Instead, it encodes the information into the Spark configuration propagated to the AM. The first problem that needed to be solved is a chicken & egg issue: the config file is distributed using the cache, and it needs to contain information about the files that are being distributed. To solve that, the code now treats the config archive especially, and uses slightly different code to distribute it, so that only its cache path needs to be saved to the config file. The second problem is that the extra information would show up in the Web UI, which made the environment tab even more noisy than it already is when lots of jars are listed. This is solved by two changes: the list of cached files is now read only once in the AM, and propagated down to the ExecutorRunnable code (which actually sends the list to the NMs when starting containers). The second change is to unset those config entries after the list is read, so that the SparkContext never sees them. Tested with both client and cluster mode by running "run-example SparkPi". This uploads a whole lot of files when run from a build dir (instead of a distribution, where the list is cleaned up), and I verified that the configs do not show up in the UI. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #12487 from vanzin/SPARK-14602.
*	[SPARK-14769][SQL] Create built-in functionality for variable substitution	Reynold Xin	2016-04-20	3	-0/+215
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? In order to fully merge the Hive parser and the SQL parser, we'd need to support variable substitution in Spark. The implementation of the substitute algorithm is mostly copied from Hive, but I simplified the overall structure quite a bit and added more comprehensive test coverage. Note that this pull request does not yet use this functionality anywhere. ## How was this patch tested? Added VariableSubstitutionSuite for unit tests. Author: Reynold Xin <rxin@databricks.com> Closes #12538 from rxin/SPARK-14769.