spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-14609][SQL] Native support for LOAD DATA DDL command	Liang-Chi Hsieh	2016-04-22	11	-8/+427
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add the native support for LOAD DATA DDL command that loads data into Hive table/partition. ## How was this patch tested? `HiveDDLCommandSuite` and `HiveQuerySuite`. Besides, few Hive tests (`WindowQuerySuite`, `HiveTableScanSuite` and `HiveSerDeSuite`) also use `LOAD DATA` command. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #12412 from viirya/ddl-load-data.
*	[SPARK-14826][SQL] Remove HiveQueryExecution	Reynold Xin	2016-04-22	20	-436/+420
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch removes HiveQueryExecution. As part of this, I consolidated all the describe commands into DescribeTableCommand. ## How was this patch tested? Should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #12588 from rxin/SPARK-14826.
*	[SPARK-10001] [CORE] Interrupt tasks in repl with Ctrl+C	Jakob Odersky	2016-04-21	5	-28/+147
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Improve signal handling to allow interrupting running tasks from the REPL (with Ctrl+C). If no tasks are running or Ctrl+C is pressed twice, the signal is forwarded to the default handler resulting in the usual termination of the application. This PR is a rewrite of -- and therefore closes #8216 -- as per piaozhexiu's request ## How was this patch tested? Signal handling is not easily testable therefore no unit tests were added. Nevertheless, the new functionality is implemented in a best-effort approach, soft-failing in case signals aren't available on a specific OS. Author: Jakob Odersky <jakob@odersky.com> Closes #12557 from jodersky/SPARK-10001-sigint.
*	[SPARK-14835][SQL] Remove MetastoreRelation dependency from SQLBuilder	Reynold Xin	2016-04-21	2	-8/+22
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch removes SQLBuilder's dependency on MetastoreRelation. We should be able to move SQLBuilder into the sql/core package after this change. ## How was this patch tested? N/A - covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #12594 from rxin/SPARK-14835.
*	[SPARK-14369] [SQL] Locality support for FileScanRDD	Cheng Lian	2016-04-21	6	-37/+291
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	(This PR is a rebased version of PR #12153.) ## What changes were proposed in this pull request? This PR adds preliminary locality support for `FileFormat` data sources by overriding `FileScanRDD.preferredLocations()`. The strategy can be divided into two parts: 1. Block location lookup Unlike `HadoopRDD` or `NewHadoopRDD`, `FileScanRDD` doesn't have access to the underlying `InputFormat` or `InputSplit`, and thus can't rely on `InputSplit.getLocations()` to gather locality information. Instead, this PR queries block locations using `FileSystem.getBlockLocations()` after listing all `FileStatus`es in `HDFSFileCatalog` and convert all `FileStatus`es into `LocatedFileStatus`es. Note that although S3/S3A/S3N file systems don't provide valid locality information, their `getLocatedStatus()` implementations don't actually issue remote calls either. So there's no need to special case these file systems. 2. Selecting preferred locations For each `FilePartition`, we pick up top 3 locations that containing the most data to be retrieved. This isn't necessarily the best algorithm out there. Further improvements may be brought up in follow-up PRs. ## How was this patch tested? Tested by overriding default `FileSystem` implementation for `file:///` with a mocked one, which returns mocked block locations. Author: Cheng Lian <lian@databricks.com> Closes #12527 from liancheng/spark-14369-locality-rebased.
*	[SPARK-14680] [SQL] Support all datatypes to use VectorizedHashmap in ↵	Sameer Agarwal	2016-04-21	5	-39/+322
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TungstenAggregate ## What changes were proposed in this pull request? This PR adds support for all primitive datatypes, decimal types and stringtypes in the VectorizedHashmap during aggregation. ## How was this patch tested? Existing tests for group-by aggregates should already test for all these datatypes. Additionally, manually inspected the generated code for all supported datatypes (details below). Author: Sameer Agarwal <sameer@databricks.com> Closes #12440 from sameeragarwal/all-datatypes.
*	[SPARK-14793] [SQL] Code generation for large complex type exceeds JVM size ↵	Takuya UESHIN	2016-04-21	3	-53/+144
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	limit. ## What changes were proposed in this pull request? Code generation for complex type, `CreateArray`, `CreateMap`, `CreateStruct`, `CreateNamedStruct`, exceeds JVM size limit for large elements. We should split generated code into multiple `apply` functions if the complex types have large elements, like `UnsafeProjection` or others for large expressions. ## How was this patch tested? I added some tests to check if the generated codes for the expressions exceed or not. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #12559 from ueshin/issues/SPARK-14793.
*	[SPARK-14824][SQL] Rename HiveContext object to HiveUtils	Andrew Or	2016-04-21	20	-55/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Just a rename so we can get rid of `HiveContext.scala`. Note that this will conflict with #12585. ## How was this patch tested? No change in functionality. Author: Andrew Or <andrew@databricks.com> Closes #12586 from andrewor14/rename-hc-object.
*	[HOTFIX] Fix Java 7 compilation break	Reynold Xin	2016-04-21	4	-11/+6
\|
*	[SPARK-14821][SQL] Implement AnalyzeTable in sql/core and remove ↵	Reynold Xin	2016-04-21	13	-226/+199
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	HiveSqlAstBuilder ## What changes were proposed in this pull request? This patch moves analyze table parsing into SparkSqlAstBuilder and removes HiveSqlAstBuilder. In order to avoid extensive refactoring, I created a common trait for CatalogRelation and MetastoreRelation, and match on that. In the future we should probably just consolidate the two into a single thing so we don't need this common trait. ## How was this patch tested? Updated unit tests. Author: Reynold Xin <rxin@databricks.com> Closes #12584 from rxin/SPARK-14821.
*	[SPARK-14479][ML] GLM supports output link prediction	Yanbo Liang	2016-04-21	2	-34/+108
\| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? GLM supports output link prediction. ## How was this patch tested? unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #12287 from yanboliang/spark-14479.
*	[SPARK-14734][ML][MLLIB] Added asML, fromML methods for all spark.mllib ↵	Joseph K. Bradley	2016-04-21	5	-2/+139
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Vector, Matrix types ## What changes were proposed in this pull request? For maintaining wrappers around spark.mllib algorithms in spark.ml, it will be useful to have ```private[spark]``` methods for converting from one linear algebra representation to another. This PR adds toNew, fromNew methods for all spark.mllib Vector and Matrix types. ## How was this patch tested? Unit tests for all conversions Author: Joseph K. Bradley <joseph@databricks.com> Closes #12504 from jkbradley/linalg-conversions.
*	[SPARK-14724] Use radix sort for shuffles and sort operator when possible	Eric Liang	2016-04-21	24	-119/+876
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Spark currently uses TimSort for all in-memory sorts, including sorts done for shuffle. One low-hanging fruit is to use radix sort when possible (e.g. sorting by integer keys). This PR adds a radix sort implementation to the unsafe sort package and switches shuffles and sorts to use it when possible. The current implementation does not have special support for null values, so we cannot radix-sort `LongType`. I will address this in a follow-up PR. ## How was this patch tested? Unit tests, enabling radix sort on existing tests. Microbenchmark results: ``` Running benchmark: radix sort 25000000 Java HotSpot(TM) 64-Bit Server VM 1.8.0_66-b17 on Linux 3.13.0-44-generic Intel(R) Core(TM) i7-4600U CPU 2.10GHz radix sort 25000000: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- reference TimSort key prefix array 15546 / 15859 1.6 621.9 1.0X reference Arrays.sort 2416 / 2446 10.3 96.6 6.4X radix sort one byte 133 / 137 188.4 5.3 117.2X radix sort two bytes 255 / 258 98.2 10.2 61.1X radix sort eight bytes 991 / 997 25.2 39.6 15.7X radix sort key prefix array 1540 / 1563 16.2 61.6 10.1X ``` I also ran a mix of the supported TPCDS queries and compared TimSort vs RadixSort metrics. The overall benchmark ran ~10% faster with radix sort on. In the breakdown below, the radix-enabled sort phases averaged about 20x faster than TimSort, however sorting is only a small fraction of the overall runtime. About half of the TPCDS queries were able to take advantage of radix sort. ``` TPCDS on master: 2499s real time, 8185s executor - 1171s in TimSort, avg 267 MB/s (note the /s accounting is weird here since dataSize counts the record sizes too) TPCDS with radix enabled: 2294s real time, 7391s executor - 596s in TimSort, avg 254 MB/s - 26s in radix sort, avg 4.2 GB/s ``` cc davies rxin Author: Eric Liang <ekl@databricks.com> Closes #12490 from ericl/sort-benchmark.
*	[SPARK-14569][ML] Log instrumentation in KMeans	Xin Ren	2016-04-21	3	-6/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-14569 Log instrumentation in KMeans: - featuresCol - predictionCol - k - initMode - initSteps - maxIter - seed - tol - summary ## How was this patch tested? Manually test on local machine, by running and checking output of org.apache.spark.examples.ml.KMeansExample Author: Xin Ren <iamshrek@126.com> Closes #12432 from keypointt/SPARK-14569.
*	[SPARK-14780] [R] Add `setLogLevel` to SparkR	Dongjoon Hyun	2016-04-21	3	-0/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR aims to add `setLogLevel` function to SparkR shell. Spark Shell ```scala scala> sc.setLogLevel("ERROR") ``` PySpark ```python >>> sc.setLogLevel("ERROR") ``` SparkR (this PR) ```r > setLogLevel(sc, "ERROR") NULL ``` ## How was this patch tested? Pass the Jenkins tests including a new R testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12547 from dongjoon-hyun/SPARK-14780.
*	[SPARK-14774][SQL] Write unscaled values in ColumnVector.putDecimal	Sameer Agarwal	2016-04-21	3	-30/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We recently made `ColumnarBatch.row` mutable and added a new `ColumnVector.putDecimal` method to support putting `Decimal` values in the `ColumnarBatch`. This unfortunately introduced a bug wherein we were not updating the vector with the proper unscaled values. ## How was this patch tested? This codepath is hit only when the vectorized aggregate hashmap is enabled. https://github.com/apache/spark/pull/12440 makes sure that a number of regression tests/benchmarks test this bugfix. Author: Sameer Agarwal <sameer@databricks.com> Closes #12541 from sameeragarwal/fix-bigdecimal.
*	[SPARK-14798][SQL] Move native command and script transformation parsing ↵	Reynold Xin	2016-04-21	15	-182/+192
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	into SparkSqlAstBuilder ## What changes were proposed in this pull request? This patch moves native command and script transformation into SparkSqlAstBuilder. This builds on #12561. See the last commit for diff. ## How was this patch tested? Updated test cases to reflect this. Author: Reynold Xin <rxin@databricks.com> Closes #12564 from rxin/SPARK-14798.
*	[MINOR] Comment whitespace changes in #12553	Andrew Or	2016-04-21	1	-9/+10
\|
*	[SPARK-13643][SQL] Implement SparkSession	Andrew Or	2016-04-21	6	-197/+964
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? After removing most of `HiveContext` in 8fc267ab3322e46db81e725a5cb1adb5a71b2b4d we can now move existing functionality in `SQLContext` to `SparkSession`. As of this PR `SQLContext` becomes a simple wrapper that has a `SparkSession` and delegates all functionality to it. ## How was this patch tested? Jenkins. Author: Andrew Or <andrew@databricks.com> Closes #12553 from andrewor14/implement-spark-session.
*	[SPARK-14801][SQL] Move MetastoreRelation to its own file	Reynold Xin	2016-04-21	2	-205/+232
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This class is currently in HiveMetastoreCatalog.scala, which is a large file that makes refactoring and searching of usage difficult. Moving it out so I can then do SPARK-14799 and make the review of that simpler. ## How was this patch tested? N/A - this is a straightforward move and should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #12567 from rxin/SPARK-14801.
*	[SPARK-14699][CORE] Stop endpoints before closing the connections and don't ↵	Shixiong Zhu	2016-04-21	3	-8/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	stop client in Outbox ## What changes were proposed in this pull request? In general, `onDisconnected` is for dealing with unexpected network disconnections. When RpcEnv.shutdown is called, the disconnections are expected so RpcEnv should not fire these events. This PR moves `dispatcher.stop()` above closing the connections so that when stopping RpcEnv, the endpoints won't receive `onDisconnected` events. In addition, Outbox should not close the client since it will be reused by others. This PR fixes it as well. ## How was this patch tested? test("SPARK-14699: RpcEnv.shutdown should not fire onDisconnected events") Author: Shixiong Zhu <shixiong@databricks.com> Closes #12481 from zsxwing/SPARK-14699.
*	[SPARK-14795][SQL] Remove the use of Hive's variable substitution	Reynold Xin	2016-04-21	3	-11/+8
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch builds on #12556 and completely removes the use of Hive's variable substitution. ## How was this patch tested? Covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #12561 from rxin/SPARK-14795.
*	[SPARK-14799][SQL] Remove MetastoreRelation dependency from AnalyzeTable - ↵	Reynold Xin	2016-04-21	1	-26/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	part 1 ## What changes were proposed in this pull request? This patch isolates AnalyzeTable's dependency on MetastoreRelation into a single line. After this we can work on converging MetastoreRelation and CatalogTable. ## How was this patch tested? Covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #12566 from rxin/SPARK-14799.
*	[SPARK-14783] Preserve full exception stacktrace in IsolatedClientLoader	Josh Rosen	2016-04-21	1	-1/+1
\| \| \| \| \| \| \| \|	In IsolatedClientLoader, we have a`catch` block which throws an exception without wrapping the original exception, causing the full exception stacktrace and any nested exceptions to be lost. This patch fixes this, improving the usefulness of classloading error messages. Author: Josh Rosen <joshrosen@databricks.com> Closes #12548 from JoshRosen/improve-logging-for-hive-classloader-issues.
*	[SPARK-4452] [CORE] Shuffle data structures can starve others on the same ↵	Lianhui Wang	2016-04-21	8	-46/+324
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	thread for memory ## What changes were proposed in this pull request? In #9241 It implemented a mechanism to call spill() on those SQL operators that support spilling if there is not enough memory for execution. But ExternalSorter and AppendOnlyMap in Spark core are not worked. So this PR make them benefit from #9241. Now when there is not enough memory for execution, it can get memory by spilling ExternalSorter and AppendOnlyMap in Spark core. ## How was this patch tested? add two unit tests for it. Author: Lianhui Wang <lianhuiwang09@gmail.com> Closes #10024 from lianhuiwang/SPARK-4452-2.
*	[SPARK-14797][BUILD] Spark SQL POM should not hardcode spark-sketch_2.11 dep.	Josh Rosen	2016-04-21	2	-1/+51
\| \| \| \| \| \| \| \| \| \| \| \|	Spark SQL's POM hardcodes a dependency on `spark-sketch_2.11`, which causes Scala 2.10 builds to include the `_2.11` dependency. This is harmless since `spark-sketch` is a pure-Java module (see #12334 for a discussion of dropping the Scala version suffixes from these modules' artifactIds), but it's confusing to people looking at the published POMs. This patch fixes this by using `${scala.binary.version}` to substitute the correct suffix, and also adds a set of Maven Enforcer rules to ensure that `_2.11` artifacts are not used in 2.10 builds (and vice-versa). /cc ahirreddy, who spotted this issue. Author: Josh Rosen <joshrosen@databricks.com> Closes #12563 from JoshRosen/fix-sketch-scala-version.
*	[SPARK-13988][CORE] Make replaying event logs multi threaded in Histo…ry ↵	Parth Brahmbhatt	2016-04-21	2	-43/+56
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	server to ensure a single large log does not block other logs from being rendered. ## What changes were proposed in this pull request? The patch makes event log processing multi threaded. ## How was this patch tested? Existing tests pass, there is no new tests needed to test the functionality as this is a perf improvement. I tested the patch locally by generating one big event log (big1), one small event log(small1) and again a big event log(big2). Without this patch UI does not render any app for almost 30 seconds and then big2 and small1 appears. another 30 second delay and finally big1 also shows up in UI. With this change small1 shows up immediately and big1 and big2 comes up in 30 seconds. Locally it also displays them in the correct order in the UI. Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com> Closes #11800 from Parth-Brahmbhatt/SPARK-13988.
*	[HOTFIX] Remove wrong DDL tests	Liang-Chi Hsieh	2016-04-21	1	-13/+0
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? As we moved most parsing rules to `SparkSqlParser`, some tests expected to throw exception are not correct anymore. ## How was this patch tested? `DDLCommandSuite` Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #12572 from viirya/hotfix-ddl.
*	[SPARK-14779][CORE] Corrected log message in Worker case KillExecutor	Bryan Cutler	2016-04-21	1	-1/+1
\| \| \| \| \| \| \| \|	In o.a.s.deploy.worker.Worker.scala, when receiving a KillExecutor message from an invalid Master, fixed typo by changing the log message to read "..attemped to kill executor.." Author: Bryan Cutler <cutlerb@gmail.com> Closes #12546 from BryanCutler/worker-killexecutor-log-message.
*	[SPARK-14787][SQL] Upgrade Joda-Time library from 2.9 to 2.9.3	hyukjinkwon	2016-04-21	6	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-14787 The possible problems are described in the JIRA above. Please refer this if you are wondering the purpose of this PR. This PR upgrades Joda-Time library from 2.9 to 2.9.3. ## How was this patch tested? `sbt scalastyle` and Jenkins tests in this PR. closes #11847 Author: hyukjinkwon <gurwls223@gmail.com> Closes #12552 from HyukjinKwon/SPARK-14787.
*	[SPARK-14739][PYSPARK] Fix Vectors parser bugs	Arash Parsa	2016-04-21	2	-8/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The PySpark deserialization has a bug that shows while deserializing all zero sparse vectors. This fix filters out empty string tokens before casting, hence properly stringified SparseVectors successfully get parsed. ## How was this patch tested? Standard unit-tests similar to other methods. Author: Arash Parsa <arash@ip-192-168-50-106.ec2.internal> Author: Arash Parsa <arashpa@gmail.com> Author: Vishnu Prasad <vishnu667@gmail.com> Author: Vishnu Prasad S <vishnu667@gmail.com> Closes #12516 from arashpa/SPARK-14739.
*	[SPARK-8393][STREAMING] JavaStreamingContext#awaitTermination() throws ↵	Sean Owen	2016-04-21	11	-14/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	non-declared InterruptedException ## What changes were proposed in this pull request? `JavaStreamingContext.awaitTermination` methods should be declared as `throws[InterruptedException]` so that this exception can be handled in Java code. Note this is not just a doc change, but an API change, since now (in Java) the method has a checked exception to handle. All await-like methods in Java APIs behave this way, so seems worthwhile for 2.0. ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #12418 from srowen/SPARK-8393.
*	[SPARK-14753][CORE] remove internal flag in Accumulable	Wenchen Fan	2016-04-21	14	-122/+98
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? the `Accumulable.internal` flag is only used to avoid registering internal accumulators for 2 certain cases: 1. `TaskMetrics.createTempShuffleReadMetrics`: the accumulators in the temp shuffle read metrics should not be registered. 2. `TaskMetrics.fromAccumulatorUpdates`: the created task metrics is only used to post event, accumulators inside it should not be registered. For 1, we can create a `TempShuffleReadMetrics` that don't create accumulators, just keep the data and merge it at last. For 2, we can un-register these accumulators immediately. TODO: remove `internal` flag in `AccumulableInfo` with followup PR ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12525 from cloud-fan/acc.
*	[SPARK-14794][SQL] Don't pass analyze command into Hive	Reynold Xin	2016-04-21	2	-6/+8
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We shouldn't pass analyze command to Hive because some of those would require running MapReduce jobs. For now, let's just always run the no scan analyze. ## How was this patch tested? Updated test case to reflect this change. Author: Reynold Xin <rxin@databricks.com> Closes #12558 from rxin/parser-analyze.
*	[HOTFIX] Disable flaky tests	Reynold Xin	2016-04-21	1	-2/+2
\|
*	[SPARK-14792][SQL] Move as many parsing rules as possible into SQL parser	Reynold Xin	2016-04-21	14	-489/+568
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch moves as many parsing rules as possible into SQL parser. There are only three more left after this patch: (1) run native command, (2) analyze, and (3) script IO. These 3 will be dealt with in a follow-up PR. ## How was this patch tested? No test change. This simply moves code around. Author: Reynold Xin <rxin@databricks.com> Closes #12556 from rxin/SPARK-14792.
*	[SPARK-14786] Remove hive-cli dependency from hive subproject	Josh Rosen	2016-04-20	3	-7/+33
\| \| \| \| \| \| \| \| \| \| \| \|	The `hive` subproject currently depends on `hive-cli` in order to perform a check to see whether a `SessionState` is an instance of `org.apache.hadoop.hive.cli.CliSessionState` (see #9589). The introduction of this `hive-cli` dependency has caused problems for users whose Hive metastore JAR classpaths don't include the `hive-cli` classes (such as in #11495). This patch removes this dependency on `hive-cli` and replaces the `isInstanceOf` check by reflection. I added a Maven Enforcer rule to ban `hive-cli` from the `hive` subproject in order to make sure that this dependency is not accidentally reintroduced. /cc rxin yhuai adrian-wang preecet Author: Josh Rosen <joshrosen@databricks.com> Closes #12551 from JoshRosen/remove-hive-cli-dep-from-hive-subproject.
*	[SPARK-14782][SPARK-14778][SQL] Remove HiveConf dependency from ↵	Reynold Xin	2016-04-20	4	-40/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	HiveSqlAstBuilder ## What changes were proposed in this pull request? The patch removes HiveConf dependency from HiveSqlAstBuilder. This is required in order to merge HiveSqlParser and SparkSqlAstBuilder, which would require getting rid of the Hive specific dependencies in HiveSqlParser. This patch also accomplishes [SPARK-14778] Remove HiveSessionState.substitutor. ## How was this patch tested? This should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #12550 from rxin/SPARK-14782.
*	[HOTFIX] Ignore all Docker integration tests	Josh Rosen	2016-04-20	3	-0/+9
\| \| \| \| \| \| \| \|	The Docker integration tests are failing very often (https://spark-tests.appspot.com/failed-tests) so I think we should disable these suites for now until we have time to improve them. Author: Josh Rosen <joshrosen@databricks.com> Closes #12549 from JoshRosen/ignore-all-docker-tests.
*	[SPARK-14775][SQL] Remove TestHiveSparkSession.rewritePaths	Reynold Xin	2016-04-20	4	-22/+17
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The path rewrite in TestHiveSparkSession is pretty hacky. I think we can remove those complexity and just do a string replacement when we read the query files in. This would remove the overloading of runNativeSql in TestHive, which will simplify the removal of Hive specific variable substitution. ## How was this patch tested? This is a small test refactoring to simplify test infrastructure. Author: Reynold Xin <rxin@databricks.com> Closes #12543 from rxin/SPARK-14775.
*	[SPARK-14602][YARN] Use SparkConf to propagate the list of cached files.	Marcelo Vanzin	2016-04-20	11	-175/+239
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This change avoids using the environment to pass this information, since with many jars it's easy to hit limits on certain OSes. Instead, it encodes the information into the Spark configuration propagated to the AM. The first problem that needed to be solved is a chicken & egg issue: the config file is distributed using the cache, and it needs to contain information about the files that are being distributed. To solve that, the code now treats the config archive especially, and uses slightly different code to distribute it, so that only its cache path needs to be saved to the config file. The second problem is that the extra information would show up in the Web UI, which made the environment tab even more noisy than it already is when lots of jars are listed. This is solved by two changes: the list of cached files is now read only once in the AM, and propagated down to the ExecutorRunnable code (which actually sends the list to the NMs when starting containers). The second change is to unset those config entries after the list is read, so that the SparkContext never sees them. Tested with both client and cluster mode by running "run-example SparkPi". This uploads a whole lot of files when run from a build dir (instead of a distribution, where the list is cleaned up), and I verified that the configs do not show up in the UI. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #12487 from vanzin/SPARK-14602.
*	[SPARK-14769][SQL] Create built-in functionality for variable substitution	Reynold Xin	2016-04-20	3	-0/+215
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? In order to fully merge the Hive parser and the SQL parser, we'd need to support variable substitution in Spark. The implementation of the substitute algorithm is mostly copied from Hive, but I simplified the overall structure quite a bit and added more comprehensive test coverage. Note that this pull request does not yet use this functionality anywhere. ## How was this patch tested? Added VariableSubstitutionSuite for unit tests. Author: Reynold Xin <rxin@databricks.com> Closes #12538 from rxin/SPARK-14769.
*	[SPARK-14770][SQL] Remove unused queries in hive module test resources	Reynold Xin	2016-04-20	690	-5352/+0
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We currently have five folders in queries: clientcompare, clientnegative, clientpositive, negative, and positive. Only clientpositive is used. We can remove the rest. ## How was this patch tested? N/A - removing unused test resources. Author: Reynold Xin <rxin@databricks.com> Closes #12540 from rxin/SPARK-14770.
*	[SPARK-14749][SQL, TESTS] PlannerSuite failed when it run individually	Subhobrata Dey	2016-04-20	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? 3 testcases namely, ``` "count is partially aggregated" "count distinct is partially aggregated" "mixed aggregates are partially aggregated" ``` were failing when running PlannerSuite individually. The PR provides a fix for this. ## How was this patch tested? unit tests (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Subhobrata Dey <sbcd90@gmail.com> Closes #12532 from sbcd90/plannersuitetestsfix.
*	[SPARK-13842] [PYSPARK] pyspark.sql.types.StructType accessor enhancements	Sheamus K. Parkes	2016-04-20	2	-9/+58
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Expand the possible ways to interact with the contents of a `pyspark.sql.types.StructType` instance. - Iterating a `StructType` will iterate its fields - `[field.name for field in my_structtype]` - Indexing with a string will return a field by name - `my_structtype['my_field_name']` - Indexing with an integer will return a field by position - `my_structtype[0]` - Indexing with a slice will return a new `StructType` with just the chosen fields: - `my_structtype[1:3]` - The length is the number of fields (should also provide "truthiness" for free) - `len(my_structtype) == 2` ## How was this patch tested? Extended the unit test coverage in the accompanying `tests.py`. Author: Sheamus K. Parkes <shea.parkes@milliman.com> Closes #12251 from skparkes/pyspark-structtype-enhance.
*	[SPARK-14678][SQL] Add a file sink log to support versioning and compaction	Shixiong Zhu	2016-04-20	6	-27/+616
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR adds a special log for FileStreamSink for two purposes: - Versioning. A future Spark version should be able to read the metadata of an old FileStreamSink. - Compaction. As reading from many small files is usually pretty slow, we should compact small metadata files into big files. FileStreamSinkLog has a new log format instead of Java serialization format. It will write one log file for each batch. The first line of the log file is the version number, and there are multiple JSON lines following. Each JSON line is a JSON format of FileLog. FileStreamSinkLog will compact log files every "spark.sql.sink.file.log.compactLen" batches into a big file. When doing a compact, it will read all history logs and merge them with the new batch. During the compaction, it will also delete the files that are deleted (marked by FileLog.action). When the reader uses allLogs to list all files, this method only returns the visible files (drops the deleted files). ## How was this patch tested? FileStreamSinkLogSuite Author: Shixiong Zhu <shixiong@databricks.com> Closes #12435 from zsxwing/sink-log.
*	[MINOR][ML][PYSPARK] Fix omissive params which should use TypeConverter	Yanbo Liang	2016-04-20	2	-4/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? #11663 adds type conversion functionality for parameters in Pyspark. This PR find out the omissive ```Param``` that did not pass corresponding ```TypeConverter``` argument and fix them. After this PR, all params in pyspark/ml/ used ```TypeConverter```. ## How was this patch tested? Existing tests. cc jkbradley sethah Author: Yanbo Liang <ybliang8@gmail.com> Closes #12529 from yanboliang/typeConverter.
*	[SPARK-14720][SPARK-13643] Move Hive-specific methods into HiveSessionState ↵	Andrew Or	2016-04-20	43	-547/+797
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	and Create a SparkSession class ## What changes were proposed in this pull request? This PR has two main changes. 1. Move Hive-specific methods from HiveContext to HiveSessionState, which help the work of removing HiveContext. 2. Create a SparkSession Class, which will later be the entry point of Spark SQL users. ## How was this patch tested? Existing tests This PR is trying to fix test failures of https://github.com/apache/spark/pull/12485. Author: Andrew Or <andrew@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #12522 from yhuai/spark-session.
*	[SPARK-14741][SQL] Fixed error in reading json file stream inside a ↵	Tathagata Das	2016-04-20	2	-1/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	partitioned directory ## What changes were proposed in this pull request? Consider the following directory structure dir/col=X/some-files If we create a text format streaming dataframe on `dir/col=X/` then it should not consider as partitioning in columns. Even though the streaming dataframe does not do so, the generated batch dataframes pick up col as a partitioning columns, causing mismatch streaming source schema and generated df schema. This leads to runtime failure: ``` 18:55:11.262 ERROR org.apache.spark.sql.execution.streaming.StreamExecution: Query query-0 terminated with error java.lang.AssertionError: assertion failed: Invalid batch: c#2 != c#7,type#8 ``` The reason is that the partition inferring code has no idea of a base path, above which it should not search of partitions. This PR makes sure that the batch DF is generated with the basePath set as the original path on which the file stream source is defined. ## How was this patch tested? New unit test Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #12517 from tdas/SPARK-14741.
*	[SPARK-14478][ML][MLLIB][DOC] Doc that StandardScaler uses the corrected ↵	Joseph K. Bradley	2016-04-20	2	-0/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	sample std ## What changes were proposed in this pull request? Currently, MLlib's StandardScaler scales columns using the corrected standard deviation (sqrt of unbiased variance). This matches what R's scale package does. This PR documents this fact. ## How was this patch tested? doc only Author: Joseph K. Bradley <joseph@databricks.com> Closes #12519 from jkbradley/scaler-variance-doc.