spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-16104] [SQL] Do not creaate CSV writer object for every flush when ↵	hyukjinkwon	2016-06-21	2	-11/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	writing ## What changes were proposed in this pull request? This PR let `CsvWriter` object is not created for each time but able to be reused. This way was taken after from JSON data source. Original `CsvWriter` was being created for each row but it was enhanced in https://github.com/apache/spark/pull/13229. However, it still creates `CsvWriter` object for each `flush()` in `LineCsvWriter`. It seems it does not have to close the object and re-create this for every flush. It follows the original logic as it is but `CsvWriter` is reused by reseting `CharArrayWriter`. ## How was this patch tested? Existing tests should cover this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #13809 from HyukjinKwon/write-perf.
*	[SPARK-16002][SQL] Sleep when no new data arrives to avoid 100% CPU usage	Shixiong Zhu	2016-06-21	6	-4/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add a configuration to allow people to set a minimum polling delay when no new data arrives (default is 10ms). This PR also cleans up some INFO logs. ## How was this patch tested? Existing unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13718 from zsxwing/SPARK-16002.
*	[SPARK-16037][SQL] Follow-up: add DataFrameWriter.insertInto() test cases ↵	Cheng Lian	2016-06-21	1	-0/+48
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	for by position resolution ## What changes were proposed in this pull request? This PR migrates some test cases introduced in #12313 as a follow-up of #13754 and #13766. These test cases cover `DataFrameWriter.insertInto()`, while the former two only cover SQL `INSERT` statements. Note that the `testPartitionedTable` utility method tests both Hive SerDe tables and data source tables. ## How was this patch tested? N/A Author: Cheng Lian <lian@databricks.com> Closes #13810 from liancheng/spark-16037-follow-up-tests.
*	[SPARK-16084][SQL] Minor comments update for "DESCRIBE" table	bomeng	2016-06-21	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? 1. FORMATTED is actually supported, but partition is not supported; 2. Remove parenthesis as it is not necessary just like anywhere else. ## How was this patch tested? Minor issue. I do not think it needs a test case! Author: bomeng <bmeng@us.ibm.com> Closes #13791 from bomeng/SPARK-16084.
*	[SPARK-16044][SQL] input_file_name() returns empty strings in data sources ↵	hyukjinkwon	2016-06-20	1	-2/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	based on NewHadoopRDD ## What changes were proposed in this pull request? This PR makes `input_file_name()` function return the file paths not empty strings for external data sources based on `NewHadoopRDD`, such as [spark-redshift](https://github.com/databricks/spark-redshift/blob/cba5eee1ab79ae8f0fa9e668373a54d2b5babf6b/src/main/scala/com/databricks/spark/redshift/RedshiftRelation.scala#L149) and [spark-xml](https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/util/XmlFile.scala#L39-L47). The codes with the external data sources below: ```scala df.select(input_file_name).show() ``` will produce - Before ``` +-----------------+ \|input_file_name()\| +-----------------+ \| \| +-----------------+ ``` - After ``` +--------------------+ \| input_file_name()\| +--------------------+ \|file:/private/var...\| +--------------------+ ``` ## How was this patch tested? Unit tests in `ColumnExpressionSuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #13759 from HyukjinKwon/SPARK-16044.
*	[SPARK-16056][SPARK-16057][SPARK-16058][SQL] Fix Multiple Bugs in Column ↵	gatorsmile	2016-06-20	2	-15/+98
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Partitioning in JDBC Source #### What changes were proposed in this pull request? This PR is to fix the following bugs: Issue 1: Wrong Results when lowerBound is larger than upperBound in Column Partitioning ```scala spark.read.jdbc( url = urlWithUserAndPass, table = "TEST.seq", columnName = "id", lowerBound = 4, upperBound = 0, numPartitions = 3, connectionProperties = new Properties) ``` Before code changes: The returned results are wrong and the generated partitions are wrong: ``` Part 0 id < 3 or id is null Part 1 id >= 3 AND id < 2 Part 2 id >= 2 ``` After code changes: Issue an `IllegalArgumentException` exception: ``` Operation not allowed: the lower bound of partitioning column is larger than the upper bound. lowerBound: 5; higherBound: 1 ``` Issue 2: numPartitions is more than the number of key values between upper and lower bounds ```scala spark.read.jdbc( url = urlWithUserAndPass, table = "TEST.seq", columnName = "id", lowerBound = 1, upperBound = 5, numPartitions = 10, connectionProperties = new Properties) ``` Before code changes: Returned correct results but the generated partitions are very inefficient, like: ``` Partition 0: id < 1 or id is null Partition 1: id >= 1 AND id < 1 Partition 2: id >= 1 AND id < 1 Partition 3: id >= 1 AND id < 1 Partition 4: id >= 1 AND id < 1 Partition 5: id >= 1 AND id < 1 Partition 6: id >= 1 AND id < 1 Partition 7: id >= 1 AND id < 1 Partition 8: id >= 1 AND id < 1 Partition 9: id >= 1 ``` After code changes: Adjust `numPartitions` and can return the correct answers: ``` Partition 0: id < 2 or id is null Partition 1: id >= 2 AND id < 3 Partition 2: id >= 3 AND id < 4 Partition 3: id >= 4 ``` Issue 3: java.lang.ArithmeticException when numPartitions is zero ```Scala spark.read.jdbc( url = urlWithUserAndPass, table = "TEST.seq", columnName = "id", lowerBound = 0, upperBound = 4, numPartitions = 0, connectionProperties = new Properties) ``` Before code changes: Got the following exception: ``` java.lang.ArithmeticException: / by zero ``` After code changes: Able to return a correct answer by disabling column partitioning when numPartitions is equal to or less than zero #### How was this patch tested? Added test cases to verify the results Author: gatorsmile <gatorsmile@gmail.com> Closes #13773 from gatorsmile/jdbcPartitioning.
*	[SPARK-13792][SQL] Limit logging of bad records in CSV data source	Reynold Xin	2016-06-20	4	-15/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This pull request adds a new option (maxMalformedLogPerPartition) in CSV reader to limit the maximum of logging message Spark generates per partition for malformed records. The error log looks something like ``` 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: More than 10 malformed records have been found on this partition. Malformed records from now on will not be logged. ``` Closes #12173 ## How was this patch tested? Manually tested. Author: Reynold Xin <rxin@databricks.com> Closes #13795 from rxin/SPARK-13792.
*	[SPARK-16061][SQL][MINOR] The property ↵	Kousuke Saruta	2016-06-20	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	"spark.streaming.stateStore.maintenanceInterval" should be renamed to "spark.sql.streaming.stateStore.maintenanceInterval" ## What changes were proposed in this pull request? The property spark.streaming.stateStore.maintenanceInterval should be renamed and harmonized with other properties related to Structured Streaming like spark.sql.streaming.stateStore.minDeltasForSnapshot. ## How was this patch tested? Existing unit tests. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #13777 from sarutak/SPARK-16061.
*	[SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harmonize the behavior of ↵	Tathagata Das	2016-06-20	3	-56/+420
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	DataFrameReader.text/csv/json/parquet/orc ## What changes were proposed in this pull request? Issues with current reader behavior. - `text()` without args returns an empty DF with no columns -> inconsistent, its expected that text will always return a DF with `value` string field, - `textFile()` without args fails with exception because of the above reason, it expected the DF returned by `text()` to have a `value` field. - `orc()` does not have var args, inconsistent with others - `json(single-arg)` was removed, but that caused source compatibility issues - [SPARK-16009](https://issues.apache.org/jira/browse/SPARK-16009) - user specified schema was not respected when `text/csv/...` were used with no args - [SPARK-16007](https://issues.apache.org/jira/browse/SPARK-16007) The solution I am implementing is to do the following. - For each format, there will be a single argument method, and a vararg method. For json, parquet, csv, text, this means adding json(string), etc.. For orc, this means adding orc(varargs). - Remove the special handling of text(), csv(), etc. that returns empty dataframe with no fields. Rather pass on the empty sequence of paths to the datasource, and let each datasource handle it right. For e.g, text data source, should return empty DF with schema (value: string) - Deduped docs and fixed their formatting. ## How was this patch tested? Added new unit tests for Scala and Java tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13727 from tdas/SPARK-15982.
*	[SPARK-16050][TESTS] Remove the flaky test: ConsoleSinkSuite	Shixiong Zhu	2016-06-20	1	-99/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? ConsoleSinkSuite just collects content from stdout and compare them with the expected string. However, because Spark may not stop some background threads at once, there is a race condition that other threads are outputting logs to stdout while ConsoleSinkSuite is running. Then it will make ConsoleSinkSuite fail. Therefore, I just deleted `ConsoleSinkSuite`. If we want to test ConsoleSinkSuite in future, we should refactoring ConsoleSink to make it testable instead of depending on stdout. Therefore, this test is useless and I just delete it. ## How was this patch tested? Just removed a flaky test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13776 from zsxwing/SPARK-16050.
*	[SPARK-16030][SQL] Allow specifying static partitions when inserting to data ↵	Yin Huai	2016-06-20	9	-25/+436
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	source tables ## What changes were proposed in this pull request? This PR adds the static partition support to INSERT statement when the target table is a data source table. ## How was this patch tested? New tests in InsertIntoHiveTableSuite and DataSourceAnalysisSuite. Note: This PR is based on https://github.com/apache/spark/pull/13766. The last commit is the actual change. Author: Yin Huai <yhuai@databricks.com> Closes #13769 from yhuai/SPARK-16030-1.
*	[SPARK-16036][SPARK-16037][SPARK-16034][SQL] Follow up code clean up and ↵	Yin Huai	2016-06-19	10	-61/+98
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	improvement ## What changes were proposed in this pull request? This PR is the follow-up PR for https://github.com/apache/spark/pull/13754/files and https://github.com/apache/spark/pull/13749. I will comment inline to explain my changes. ## How was this patch tested? Existing tests. Author: Yin Huai <yhuai@databricks.com> Closes #13766 from yhuai/caseSensitivity.
*	[SPARK-16031] Add debug-only socket source in Structured Streaming	Matei Zaharia	2016-06-19	9	-0/+293
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch adds a text-based socket source similar to the one in Spark Streaming for debugging and tutorials. The source is clearly marked as debug-only so that users don't try to run it in production applications, because this type of source cannot provide HA without storing a lot of state in Spark. ## How was this patch tested? Unit tests and manual tests in spark-shell. Author: Matei Zaharia <matei@databricks.com> Closes #13748 from mateiz/socket-source.
*	[SPARK-15613] [SQL] Fix incorrect days to millis conversion due to Daylight ↵	Davies Liu	2016-06-19	4	-4/+129
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Saving Time ## What changes were proposed in this pull request? Internally, we use Int to represent a date (the days since 1970-01-01), when we convert that into unix timestamp (milli-seconds since epoch in UTC), we get the offset of a timezone using local millis (the milli-seconds since 1970-01-01 in a timezone), but TimeZone.getOffset() expect unix timestamp, the result could be off by one hour (in Daylight Saving Time (DST) or not). This PR change to use best effort approximate of posix timestamp to lookup the offset. In the event of changing of DST, Some time is not defined (for example, 2016-03-13 02:00:00 PST), or could lead to multiple valid result in UTC (for example, 2016-11-06 01:00:00), this best effort approximate should be enough in practice. ## How was this patch tested? Added regression tests. Author: Davies Liu <davies@databricks.com> Closes #13652 from davies/fix_timezone.
*	[SPARK-16034][SQL] Checks the partition columns when calling ↵	Sean Zhong	2016-06-18	3	-22/+50
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	dataFrame.write.mode("append").saveAsTable ## What changes were proposed in this pull request? `DataFrameWriter` can be used to append data to existing data source tables. It becomes tricky when partition columns used in `DataFrameWriter.partitionBy(columns)` don't match the actual partition columns of the underlying table. This pull request enforces the check so that the partition columns of these two always match. ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13749 from clockfly/SPARK-16034.
*	[SPARK-16036][SPARK-16037][SQL] fix various table insertion problems	Wenchen Fan	2016-06-18	12	-185/+104
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The current table insertion has some weird behaviours: 1. inserting into a partitioned table with mismatch columns has confusing error message for hive table, and wrong result for datasource table 2. inserting into a partitioned table without partition list has wrong result for hive table. This PR fixes these 2 problems. ## How was this patch tested? new test in hive `SQLQuerySuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #13754 from cloud-fan/insert2.
*	[SPARK-16023][SQL] Move InMemoryRelation to its own file	Andrew Or	2016-06-17	2	-185/+211
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Improve readability of `InMemoryTableScanExec.scala`, which has too much stuff in it. ## How was this patch tested? Jenkins Author: Andrew Or <andrew@databricks.com> Closes #13742 from andrewor14/move-inmemory-relation.
*	[SPARK-16020][SQL] Fix complete mode aggregation with console sink	Shixiong Zhu	2016-06-17	3	-1/+105
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We cannot use `limit` on DataFrame in ConsoleSink because it will use a wrong planner. This PR just collects `DataFrame` and calls `show` on a batch DataFrame based on the result. This is fine since ConsoleSink is only for debugging. ## How was this patch tested? Manually confirmed ConsoleSink now works with complete mode aggregation. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13740 from zsxwing/complete-console.
*	[SPARK-15159][SPARKR] SparkR SparkSession API	Felix Cheung	2016-06-17	1	-12/+64
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR introduces the new SparkSession API for SparkR. `sparkR.session.getOrCreate()` and `sparkR.session.stop()` "getOrCreate" is a bit unusual in R but it's important to name this clearly. SparkR implementation should - SparkSession is the main entrypoint (vs SparkContext; due to limited functionality supported with SparkContext in SparkR) - SparkSession replaces SQLContext and HiveContext (both a wrapper around SparkSession, and because of API changes, supporting all 3 would be a lot more work) - Changes to SparkSession is mostly transparent to users due to SPARK-10903 - Full backward compatibility is expected - users should be able to initialize everything just in Spark 1.6.1 (`sparkR.init()`), but with deprecation warning - Mostly cosmetic changes to parameter list - users should be able to move to `sparkR.session.getOrCreate()` easily - An advanced syntax with named parameters (aka varargs aka "...") is supported; that should be closer to the Builder syntax that is in Scala/Python (which unfortunately does not work in R because it will look like this: `enableHiveSupport(config(config(master(appName(builder(), "foo"), "local"), "first", "value"), "next, "value"))` - Updating config on an existing SparkSession is supported, the behavior is the same as Python, in which config is applied to both SparkContext and SparkSession - Some SparkSession changes are not matched in SparkR, mostly because it would be breaking API change: `catalog` object, `createOrReplaceTempView` - Other SQLContext workarounds are replicated in SparkR, eg. `tables`, `tableNames` - `sparkR` shell is updated to use the SparkSession entrypoint (`sqlContext` is removed, just like with Scale/Python) - All tests are updated to use the SparkSession entrypoint - A bug in `read.jdbc` is fixed TODO - [x] Add more tests - [ ] Separate PR - update all roxygen2 doc coding example - [ ] Separate PR - update SparkR programming guide ## How was this patch tested? unit tests, manual tests shivaram sun-rui rxin Author: Felix Cheung <felixcheung_m@hotmail.com> Author: felixcheung <felixcheung_m@hotmail.com> Closes #13635 from felixcheung/rsparksession.
*	[SPARK-16033][SQL] insertInto() can't be used together with partitionBy()	Cheng Lian	2016-06-17	2	-3/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? When inserting into an existing partitioned table, partitioning columns should always be determined by catalog metadata of the existing table to be inserted. Extra `partitionBy()` calls don't make sense, and mess up existing data because newly inserted data may have wrong partitioning directory layout. ## How was this patch tested? New test case added in `InsertIntoHiveTableSuite`. Author: Cheng Lian <lian@databricks.com> Closes #13747 from liancheng/spark-16033-insert-into-without-partition-by.
*	[SPARK-15916][SQL] JDBC filter push down should respect operator precedence	hyukjinkwon	2016-06-17	2	-2/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR fixes the problem that the precedence order is messed when pushing where-clause expression to JDBC layer. Case 1: For sql `select * from table where (a or b) and c`, the where-clause is wrongly converted to JDBC where-clause `a or (b and c)` after filter push down. The consequence is that JDBC may returns less or more rows than expected. Case 2: For sql `select * from table where always_false_condition`, the result table may not be empty if the JDBC RDD is partitioned using where-clause: ``` spark.read.jdbc(url, table, predicates = Array("partition 1 where clause", "partition 2 where clause"...) ``` ## How was this patch tested? Unit test. This PR also close #13640 Author: hyukjinkwon <gurwls223@gmail.com> Author: Sean Zhong <seanzhong@databricks.com> Closes #13743 from clockfly/SPARK-15916.
*	[SPARK-16014][SQL] Rename optimizer rules to be more consistent	Reynold Xin	2016-06-17	6	-22/+19
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This small patch renames a few optimizer rules to make the naming more consistent, e.g. class name start with a verb. The main important "fix" is probably SamplePushDown -> PushProjectThroughSample. SamplePushDown is actually the wrong name, since the rule is not about pushing Sample down. ## How was this patch tested? Updated test cases. Author: Reynold Xin <rxin@databricks.com> Closes #13732 from rxin/SPARK-16014.
*	Remove non-obvious conf settings from TPCDS benchmark	Sameer Agarwal	2016-06-17	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? My fault -- these 2 conf entries are mysteriously hidden inside the benchmark code and makes it non-obvious to disable whole stage codegen and/or the vectorized parquet reader. PS: Didn't attach a JIRA as this change should otherwise be a no-op (both these conf are enabled by default in Spark) ## How was this patch tested? N/A Author: Sameer Agarwal <sameer@databricks.com> Closes #13726 from sameeragarwal/tpcds-conf.
*	[SPARK-15811][SQL] fix the Python UDF in Scala 2.10	Davies Liu	2016-06-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Iterator can't be serialized in Scala 2.10, we should force it into a array to make sure that . ## How was this patch tested? Build with Scala 2.10 and ran all the Python unit tests manually (will be covered by a jenkins build). Author: Davies Liu <davies@databricks.com> Closes #13717 from davies/fix_udf_210.
*	[SPARK-15706][SQL] Fix Wrong Answer when using IF NOT EXISTS in INSERT ↵	gatorsmile	2016-06-16	5	-5/+85
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	OVERWRITE for DYNAMIC PARTITION #### What changes were proposed in this pull request? `IF NOT EXISTS` in `INSERT OVERWRITE` should not support dynamic partitions. If we specify `IF NOT EXISTS`, the inserted statement is not shown in the table. This PR is to issue an exception in this case, just like what Hive does. Also issue an exception if users specify `IF NOT EXISTS` if users do not specify any `PARTITION` specification. #### How was this patch tested? Added test cases into `PlanParserSuite` and `InsertIntoHiveTableSuite` Author: gatorsmile <gatorsmile@gmail.com> Closes #13447 from gatorsmile/insertIfNotExist.
*	[SPARK-15822] [SQL] Prevent byte array backed classes from referencing freed ↵	Pete Robbins	2016-06-16	2	-7/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	memory ## What changes were proposed in this pull request? `UTF8String` and all `Unsafe` classes are backed by either on-heap or off-heap byte arrays. The code generated version `SortMergeJoin` buffers the left hand side join keys during iteration. This was actually problematic in off-heap mode when one of the keys is a `UTF8String` (or any other 'Unsafe` object) and the left hand side iterator was exhausted (and released its memory); the buffered keys would reference freed memory. This causes Seg-faults and all kinds of other undefined behavior when we would use one these buffered keys. This PR fixes this problem by creating copies of the buffered variables. I have added a general method to the `CodeGenerator` for this. I have checked all places in which this could happen, and only `SortMergeJoin` had this problem. This PR is largely based on the work of robbinspg and he should be credited for this. closes https://github.com/apache/spark/pull/13707 ## How was this patch tested? Manually tested on problematic workloads. Author: Pete Robbins <robbinspg@gmail.com> Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13723 from hvanhovell/SPARK-15822-2.
*	[SPARK-15991] SparkContext.hadoopConfiguration should be always the base of ↵	Yin Huai	2016-06-16	5	-17/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	hadoop conf created by SessionState ## What changes were proposed in this pull request? Before this patch, after a SparkSession has been created, hadoop conf set directly to SparkContext.hadoopConfiguration will not affect the hadoop conf created by SessionState. This patch makes the change to always use SparkContext.hadoopConfiguration as the base. This patch also changes the behavior of hive-site.xml support added in https://github.com/apache/spark/pull/12689/. With this patch, we will load hive-site.xml to SparkContext.hadoopConfiguration. ## How was this patch tested? New test in SparkSessionBuilderSuite. Author: Yin Huai <yhuai@databricks.com> Closes #13711 from yhuai/SPARK-15991.
*	[SPARK-15749][SQL] make the error message more meaningful	Huaxin Gao	2016-06-16	2	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? For table test1 (C1 varchar (10), C2 varchar (10)), when I insert a row using ``` sqlContext.sql("insert into test1 values ('abc', 'def', 1)") ``` I got error message ``` Exception in thread "main" java.lang.RuntimeException: RelationC1#0,C2#1 JDBCRelation(test1) requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE statement generates the same number of columns as its schema. ``` The error message is a little confusing. In my simple insert statement, it doesn't have a SELECT clause. I will change the error message to a more general one ``` Exception in thread "main" java.lang.RuntimeException: RelationC1#0,C2#1 JDBCRelation(test1) requires that the data to be inserted have the same number of columns as the target table. ``` ## How was this patch tested? I tested the patch using my simple unit test, but it's a very trivial change and I don't think I need to check in any test. Author: Huaxin Gao <huaxing@us.ibm.com> Closes #13492 from huaxingao/spark-15749.
*	[MINOR][DOCS][SQL] Fix some comments about types(TypeCoercion,Partition) and ↵	Dongjoon Hyun	2016-06-16	4	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	exceptions. ## What changes were proposed in this pull request? This PR contains a few changes on code comments. - `HiveTypeCoercion` is renamed into `TypeCoercion`. - `NoSuchDatabaseException` is only used for the absence of database. - For partition type inference, only `DoubleType` is considered. ## How was this patch tested? N/A Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13674 from dongjoon-hyun/minor_doc_types.
*	[SPARK-15998][SQL] Verification of SQLConf HIVE_METASTORE_PARTITION_PRUNING	gatorsmile	2016-06-16	1	-3/+57
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	#### What changes were proposed in this pull request? `HIVE_METASTORE_PARTITION_PRUNING` is a public `SQLConf`. When `true`, some predicates will be pushed down into the Hive metastore so that unmatching partitions can be eliminated earlier. The current default value is `false`. For performance improvement, users might turn this parameter on. So far, the code base does not have such a test case to verify whether this `SQLConf` properly works. This PR is to improve the test case coverage for avoiding future regression. #### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #13716 from gatorsmile/addTestMetastorePartitionPruning.
*	[SQL] Minor HashAggregateExec string output fixes	Cheng Lian	2016-06-16	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR fixes some minor `.toString` format issues for `HashAggregateExec`. Before: ``` HashAggregate(key=[a#234L,b#235L], functions=[count(1),max(c#236L)], output=[a#234L,b#235L,count(c)#247L,max(c)#248L]) ``` After: ``` HashAggregate(keys=[a#234L, b#235L], functions=[count(1), max(c#236L)], output=[a#234L, b#235L, count(c)#247L, max(c)#248L]) ``` ## How was this patch tested? Manually tested. Author: Cheng Lian <lian@databricks.com> Closes #13710 from liancheng/minor-agg-string-fix.
*	[SPARK-15978][SQL] improve 'show tables' command related codes	bomeng	2016-06-16	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? I've found some minor issues in "show tables" command: 1. In the `SessionCatalog.scala`, `listTables(db: String)` method will call `listTables(formatDatabaseName(db), "*")` to list all the tables for certain db, but in the method `listTables(db: String, pattern: String)`, this db name is formatted once more. So I think we should remove `formatDatabaseName()` in the caller. 2. I suggest to add sort to listTables(db: String) in InMemoryCatalog.scala, just like listDatabases(). ## How was this patch tested? The existing test cases should cover it. Author: bomeng <bmeng@us.ibm.com> Closes #13695 from bomeng/SPARK-15978.
*	[SPARK-15977][SQL] Fix TRUNCATE TABLE for Spark specific datasource tables	Herman van Hovell	2016-06-16	2	-11/+21
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `TRUNCATE TABLE` is currently broken for Spark specific datasource tables (json, csv, ...). This PR correctly sets the location for these datasources which allows them to be truncated. ## How was this patch tested? Extended the datasources `TRUNCATE TABLE` tests in `DDLSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13697 from hvanhovell/SPARK-15977.
*	[SPARK-15983][SQL] Removes FileFormat.prepareRead	Cheng Lian	2016-06-16	2	-15/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Interface method `FileFormat.prepareRead()` was added in #12088 to handle a special case in the LibSVM data source. However, the semantics of this interface method isn't intuitive: it returns a modified version of the data source options map. Considering that the LibSVM case can be easily handled using schema metadata inside `inferSchema`, we can remove this interface method to keep the `FileFormat` interface clean. ## How was this patch tested? Existing tests. Author: Cheng Lian <lian@databricks.com> Closes #13698 from liancheng/remove-prepare-read.
*	[SPARK-15862][SQL] Better Error Message When Having Database Name in CACHE ↵	gatorsmile	2016-06-16	7	-64/+121
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TABLE AS SELECT #### What changes were proposed in this pull request? ~~If the temp table already exists, we should not silently replace it when doing `CACHE TABLE AS SELECT`. This is inconsistent with the behavior of `CREAT VIEW` or `CREATE TABLE`. This PR is to fix this silent drop.~~ ~~Maybe, we also can introduce new syntax for replacing the existing one. For example, in Hive, to replace a view, the syntax should be like `ALTER VIEW AS SELECT` or `CREATE OR REPLACE VIEW AS SELECT`~~ The table name in `CACHE TABLE AS SELECT` should NOT contain database prefix like "database.table". Thus, this PR captures this in Parser and outputs a better error message, instead of reporting the view already exists. In addition, refactoring the `Parser` to generate table identifiers instead of returning the table name string. #### How was this patch tested? - Added a test case for caching and uncaching qualified table names - Fixed a few test cases that do not drop temp table at the end - Added the related test case for the issue resolved in this PR Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #13572 from gatorsmile/cacheTableAsSelect.
*	[SPARK-12922][SPARKR][WIP] Implement gapply() on DataFrame in SparkR	Narine Kokhlikyan	2016-06-15	6	-13/+190
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? gapply() applies an R function on groups grouped by one or more columns of a DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() in the Dataset API. Please, let me know what do you think and if you have any ideas to improve it. Thank you! ## How was this patch tested? Unit tests. 1. Primitive test with different column types 2. Add a boolean column 3. Compute average by a group Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com> Author: NarineK <narine.kokhlikyan@us.ibm.com> Closes #12836 from NarineK/gapply2.
*	[SPARK-15824][SQL] Execute WITH .... INSERT ... statements immediately	Herman van Hovell	2016-06-15	3	-2/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We currently immediately execute `INSERT` commands when they are issued. This is not the case as soon as we use a `WITH` to define common table expressions, for example: ```sql WITH tbl AS (SELECT * FROM x WHERE id = 10) INSERT INTO y SELECT * FROM tbl ``` This PR fixes this problem. This PR closes https://github.com/apache/spark/pull/13561 (which fixes the a instance of this problem in the ThriftSever). ## How was this patch tested? Added a test to `InsertSuite` Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13678 from hvanhovell/SPARK-15824.
*	[SPARK-13498][SQL] Increment the recordsRead input metric for JDBC data source	Wayne Song	2016-06-15	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch brings https://github.com/apache/spark/pull/11373 up-to-date and increments the record count for JDBC data source. Closes #11373. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #13694 from rxin/SPARK-13498.
*	[SPARK-15979][SQL] Rename various Parquet support classes.	Reynold Xin	2016-06-15	13	-118/+119
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch renames various Parquet support classes from CatalystAbc to ParquetAbc. This new naming makes more sense for two reasons: 1. These are not optimizer related (i.e. Catalyst) classes. 2. We are in the Spark code base, and as a result it'd be more clear to call out these are Parquet support classes, rather than some Spark classes. ## How was this patch tested? Renamed test cases as well. Author: Reynold Xin <rxin@databricks.com> Closes #13696 from rxin/parquet-rename.
*	[SPARK-12492][SQL] Add missing SQLExecution.withNewExecutionId for ↵	KaiXinXiaoLei	2016-06-15	1	-14/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	hiveResultString ## What changes were proposed in this pull request? Add missing SQLExecution.withNewExecutionId for hiveResultString so that queries running in `spark-sql` will be shown in Web UI. Closes #13115 ## How was this patch tested? Existing unit tests. Author: KaiXinXiaoLei <huleilei1@huawei.com> Closes #13689 from zsxwing/pr13115.
*	[SPARK-15776][SQL] Divide Expression inside Aggregation function is casted ↵	Sean Zhong	2016-06-15	7	-19/+86
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	to wrong type ## What changes were proposed in this pull request? This PR fixes the problem that Divide Expression inside Aggregation function is casted to wrong type, which cause `select 1/2` and `select sum(1/2)`returning different result. Before the change: ``` scala> sql("select 1/2 as a").show() +---+ \| a\| +---+ \|0.5\| +---+ scala> sql("select sum(1/2) as a").show() +---+ \| a\| +---+ \|0 \| +---+ scala> sql("select sum(1 / 2) as a").schema res4: org.apache.spark.sql.types.StructType = StructType(StructField(a,LongType,true)) ``` After the change: ``` scala> sql("select 1/2 as a").show() +---+ \| a\| +---+ \|0.5\| +---+ scala> sql("select sum(1/2) as a").show() +---+ \| a\| +---+ \|0.5\| +---+ scala> sql("select sum(1/2) as a").schema res4: org.apache.spark.sql.types.StructType = StructType(StructField(a,DoubleType,true)) ``` ## How was this patch tested? Unit test. This PR is based on https://github.com/apache/spark/pull/13524 by Sephiroth-Lin Author: Sean Zhong <seanzhong@databricks.com> Closes #13651 from clockfly/SPARK-15776.
*	[SPARK-15934] [SQL] Return binary mode in ThriftServer	Egor Pakhomov	2016-06-15	3	-14/+47
\| \| \| \| \| \| \| \| \| \|	Returning binary mode to ThriftServer for backward compatibility. Tested with Squirrel and Tableau. Author: Egor Pakhomov <egor@anchorfree.com> Closes #13667 from epahomov/SPARK-15095-2.0.
*	[SPARK-15901][SQL][TEST] Verification of CONVERT_METASTORE_ORC and ↵	gatorsmile	2016-06-15	2	-32/+83
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	CONVERT_METASTORE_PARQUET #### What changes were proposed in this pull request? So far, we do not have test cases for verifying whether the external parameters `HiveUtils .CONVERT_METASTORE_ORC` and `HiveUtils.CONVERT_METASTORE_PARQUET` properly works when users use non-default values. This PR is to add such test cases for avoiding potential regression. #### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #13622 from gatorsmile/addTestCase4parquetOrcConversion.
*	[SPARK-15888] [SQL] fix Python UDF with aggregate	Davies Liu	2016-06-15	3	-10/+68
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? After we move the ExtractPythonUDF rule into physical plan, Python UDF can't work on top of aggregate anymore, because they can't be evaluated before aggregate, should be evaluated after aggregate. This PR add another rule to extract these kind of Python UDF from logical aggregate, create a Project on top of Aggregate. ## How was this patch tested? Added regression tests. The plan of added test query looks like this: ``` == Parsed Logical Plan == 'Project [<lambda>('k, 's) AS t#26] +- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS k#17, sum(cast(<lambda>(value#6) as bigint)) AS s#22L] +- LogicalRDD [key#5L, value#6] == Analyzed Logical Plan == t: int Project [<lambda>(k#17, s#22L) AS t#26] +- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS k#17, sum(cast(<lambda>(value#6) as bigint)) AS s#22L] +- LogicalRDD [key#5L, value#6] == Optimized Logical Plan == Project [<lambda>(agg#29, agg#30L) AS t#26] +- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS agg#29, sum(cast(<lambda>(value#6) as bigint)) AS agg#30L] +- LogicalRDD [key#5L, value#6] == Physical Plan == Project [pythonUDF0#37 AS t#26] +- BatchEvalPython [<lambda>(agg#29, agg#30L)], [agg#29, agg#30L, pythonUDF0#37] +- HashAggregate(key=[<lambda>(key#5L)#31], functions=[sum(cast(<lambda>(value#6) as bigint))], output=[agg#29,agg#30L]) +- Exchange hashpartitioning(<lambda>(key#5L)#31, 200) +- *HashAggregate(key=[pythonUDF0#34 AS <lambda>(key#5L)#31], functions=[partial_sum(cast(pythonUDF1#35 as bigint))], output=[<lambda>(key#5L)#31,sum#33L]) +- BatchEvalPython [<lambda>(key#5L), <lambda>(value#6)], [key#5L, value#6, pythonUDF0#34, pythonUDF1#35] +- Scan ExistingRDD[key#5L,value#6] ``` Author: Davies Liu <davies@databricks.com> Closes #13682 from davies/fix_py_udf.
*	[SPARK-15959][SQL] Add the support of hive.metastore.warehouse.dir back	Yin Huai	2016-06-15	3	-27/+106
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR adds the support of conf `hive.metastore.warehouse.dir` back. With this patch, the way of setting the warehouse dir is described as follows: * If `spark.sql.warehouse.dir` is set, `hive.metastore.warehouse.dir` will be automatically set to the value of `spark.sql.warehouse.dir`. The warehouse dir is effectively set to the value of `spark.sql.warehouse.dir`. * If `spark.sql.warehouse.dir` is not set but `hive.metastore.warehouse.dir` is set, `spark.sql.warehouse.dir` will be automatically set to the value of `hive.metastore.warehouse.dir`. The warehouse dir is effectively set to the value of `hive.metastore.warehouse.dir`. * If neither `spark.sql.warehouse.dir` nor `hive.metastore.warehouse.dir` is set, `hive.metastore.warehouse.dir` will be automatically set to the default value of `spark.sql.warehouse.dir`. The warehouse dir is effectively set to the default value of `spark.sql.warehouse.dir`. ## How was this patch tested? `set hive.metastore.warehouse.dir` in `HiveSparkSubmitSuite`. JIRA: https://issues.apache.org/jira/browse/SPARK-15959 Author: Yin Huai <yhuai@databricks.com> Closes #13679 from yhuai/hiveWarehouseDir.
*	[SPARK-15953][WIP][STREAMING] Renamed ContinuousQuery to StreamingQuery	Tathagata Das	2016-06-15	26	-154/+152
\| \| \| \| \| \| \| \| \| \|	Renamed for simplicity, so that its obvious that its related to streaming. Existing unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13673 from tdas/SPARK-15953.
*	[SPARK-15960][SQL] Rename `spark.sql.enableFallBackToHdfsForStats` config	Herman van Hovell	2016-06-15	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Since we are probably going to add more statistics related configurations in the future, I'd like to rename the newly added `spark.sql.enableFallBackToHdfsForStats` configuration option to `spark.sql.statistics.fallBackToHdfs`. This allows us to put all statistics related configurations in the same namespace. ## How was this patch tested? None - just a usability thing Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13681 from hvanhovell/SPARK-15960.
*	[SPARK-15952][SQL] fix "show databases" ordering issue	bomeng	2016-06-14	3	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Two issues I've found for "show databases" command: 1. The returned database name list was not sorted, it only works when "like" was used together; (HIVE will always return a sorted list) 2. When it is used as sql("show databases").show, it will output a table with column named as "result", but for sql("show tables").show, it will output the column name as "tableName", so I think we should be consistent and use "databaseName" at least. ## How was this patch tested? Updated existing test case to test its ordering as well. Author: bomeng <bmeng@us.ibm.com> Closes #13671 from bomeng/SPARK-15952.
*	[SPARK-15011][SQL] Re-enable 'analyze MetastoreRelations' in hive ↵	Herman van Hovell	2016-06-14	2	-5/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	StatisticsSuite ## What changes were proposed in this pull request? This test re-enables the `analyze MetastoreRelations` in `org.apache.spark.sql.hive.StatisticsSuite`. The flakiness of this test was traced back to a shared configuration option, `hive.exec.compress.output`, in `TestHive`. This property was set to `true` by the `HiveCompatibilitySuite`. I have added configuration resetting logic to `HiveComparisonTest`, in order to prevent such a thing from happening again. ## How was this patch tested? Is a test. Author: Herman van Hovell <hvanhovell@databricks.com> Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #13498 from hvanhovell/SPARK-15011.
*	[SPARK-15933][SQL][STREAMING] Refactored DF reader-writer to use readStream ↵	Tathagata Das	2016-06-14	18	-590/+1109
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	and writeStream for streaming DFs ## What changes were proposed in this pull request? Currently, the DataFrameReader/Writer has method that are needed for streaming and non-streaming DFs. This is quite awkward because each method in them through runtime exception for one case or the other. So rather having half the methods throw runtime exceptions, its just better to have a different reader/writer API for streams. - [x] Python API!! ## How was this patch tested? Existing unit tests + two sets of unit tests for DataFrameReader/Writer and DataStreamReader/Writer. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13653 from tdas/SPARK-15933.