spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	Revert "[SPARK-18697][BUILD] Upgrade sbt plugins"	Sean Owen	2016-12-07	2	-11/+14
\| \| \| \|	This reverts commit 7f31d378c4025cb3dea2b96fcf2a9e451c534df0.
*	[SPARK-18662] Move resource managers to separate directory	Anirudh	2016-12-06	81	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? * Moves yarn and mesos scheduler backends to resource-managers/ sub-directory (in preparation for https://issues.apache.org/jira/browse/SPARK-18278) * Corresponding change in top-level pom.xml. Ref: https://github.com/apache/spark/pull/16061#issuecomment-263649340 ## How was this patch tested? * Manual tests /cc rxin Author: Anirudh <ramanathana@google.com> Closes #16092 from foxish/fix-scheduler-structure-2.
*	[SPARK-18171][MESOS] Show correct framework address in mesos master web ui ↵	Shuai Lin	2016-12-07	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	when the advertised address is used ## What changes were proposed in this pull request? In [SPARK-4563](https://issues.apache.org/jira/browse/SPARK-4563) we added the support for the driver to advertise a different hostname/ip (`spark.driver.host` to the executors other than the hostname/ip the driver actually binds to (`spark.driver.bindAddress`). But in the mesos webui's frameworks page, it still shows the driver's binds hostname/ip (though the web ui link is correct). We should fix it to make them consistent. Before: ![mesos-spark](https://cloud.githubusercontent.com/assets/717363/19835148/4dffc6d4-9eb8-11e6-8999-f80f10e4c3f7.png) After: ![mesos-spark2](https://cloud.githubusercontent.com/assets/717363/19835149/54596ae4-9eb8-11e6-896c-230426acd1a1.png) This PR doesn't affect the behavior on the spark side, only makes the display on the mesos master side more consistent. ## How was this patch tested? Manual test. - Build the package and build a docker image (spark:2.1-test) ``` sh ./dev/make-distribution.sh -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn -Pmesos ``` - Then run the spark driver inside a docker container. ``` sh docker run --rm -it \ --name=spark \ -p 30000-30010:30000-30010 \ -e LIBPROCESS_ADVERTISE_IP=172.17.42.1 \ -e LIBPROCESS_PORT=30000 \ -e MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos-1.0.0.so \ -e MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos-1.0.0.so \ -e SPARK_HOME=/opt/dist \ spark:2.1-test ``` - Inside the container, launch the spark driver, making use of the advertised address: ``` sh /opt/dist/bin/spark-shell \ --master mesos://zk://172.17.42.1:2181/mesos \ --conf spark.driver.host=172.17.42.1 \ --conf spark.driver.bindAddress=172.17.0.1 \ --conf spark.driver.port=30001 \ --conf spark.driver.blockManager.port=30002 \ --conf spark.ui.port=30003 \ --conf spark.mesos.coarse=true \ --conf spark.cores.max=2 \ --conf spark.executor.cores=1 \ --conf spark.executor.memory=1g \ --conf spark.mesos.executor.docker.image=spark:2.1-test ``` - Run several spark jobs to ensure everything is running fine. ``` scala val rdd = sc.textFile("file:///opt/dist/README.md") rdd.cache().count ``` Author: Shuai Lin <linshuai2012@gmail.com> Closes #15684 from lins05/spark-18171-show-correct-host-name-in-mesos-master-web-ui.
*	[SPARK-18697][BUILD] Upgrade sbt plugins	Weiqing Yang	2016-12-07	2	-14/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR is to upgrade sbt plugins. The following sbt plugins will be upgraded: ``` sbt-assembly: 0.11.2 -> 0.14.3 sbteclipse-plugin: 4.0.0 -> 5.0.1 sbt-mima-plugin: 0.1.11 -> 0.1.12 org.ow2.asm/asm: 5.0.3 -> 5.1 org.ow2.asm/asm-commons: 5.0.3 -> 5.1 ``` All other plugins are up-to-date. ## How was this patch tested? Pass the Jenkins build. Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #16159 from weiqingy/SPARK-18697.
*	[SPARK-18652][PYTHON] Include the example data and third-party licenses in ↵	Shuai Lin	2016-12-07	2	-1/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	pyspark package. ## What changes were proposed in this pull request? Since we already include the python examples in the pyspark package, we should include the example data with it as well. We should also include the third-party licences since we distribute their jars with the pyspark package. ## How was this patch tested? Manually tested with python2.7 and python3.4 ```sh $ ./build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Pmesos clean package $ cd python $ python setup.py sdist $ pip install dist/pyspark-2.1.0.dev0.tar.gz $ ls -1 /usr/local/lib/python2.7/dist-packages/pyspark/data/ graphx mllib streaming $ du -sh /usr/local/lib/python2.7/dist-packages/pyspark/data/ 600K /usr/local/lib/python2.7/dist-packages/pyspark/data/ $ ls -1 /usr/local/lib/python2.7/dist-packages/pyspark/licenses/\|head -5 LICENSE-AnchorJS.txt LICENSE-DPark.txt LICENSE-Mockito.txt LICENSE-SnapTree.txt LICENSE-antlr.txt ``` Author: Shuai Lin <linshuai2012@gmail.com> Closes #16082 from lins05/include-data-in-pyspark-dist.
*	[SPARK-18744][CORE] Remove workaround for Netty memory leak	Shixiong Zhu	2016-12-06	1	-5/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We added some codes in https://github.com/apache/spark/pull/14961 because of https://github.com/netty/netty/issues/5833 Now we can remove them as it's fixed in Netty 4.0.42.Final. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16167 from zsxwing/remove-netty-workaround.
*	[SPARK-18374][ML] Incorrect words in StopWords/english.txt	Yuhao	2016-12-07	2	-27/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Currently English stop words list in MLlib contains only the argumented words after removing all the apostrophes, so "wouldn't" become "wouldn" and "t". Yet by default Tokenizer and RegexTokenizer don't split on apostrophes or quotes. Adding original form to stop words list to match the behavior of Tokenizer and StopwordsRemover. Also remove "won" from list. see more discussion in the jira: https://issues.apache.org/jira/browse/SPARK-18374 ## How was this patch tested? existing ut Author: Yuhao <yuhao.yang@intel.com> Author: Yuhao Yang <hhbyyh@gmail.com> Closes #16103 from hhbyyh/addstopwords.
*	[SPARK-18671][SS][TEST] Added tests to ensure stability of that all ↵	Tathagata Das	2016-12-06	15	-3/+114
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Structured Streaming log formats ## What changes were proposed in this pull request? To be able to restart StreamingQueries across Spark version, we have already made the logs (offset log, file source log, file sink log) use json. We should added tests with actual json files in the Spark such that any incompatible changes in reading the logs is immediately caught. This PR add tests for FileStreamSourceLog, FileStreamSinkLog, and OffsetSeqLog. ## How was this patch tested? new unit tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #16128 from tdas/SPARK-18671.
*	[SPARK-18714][SQL] Add a simple time function to SparkSession	Reynold Xin	2016-12-06	1	-0/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Many Spark developers often want to test the runtime of some function in interactive debugging and testing. This patch adds a simple time function to SparkSession: ``` scala> spark.time { spark.range(1000).count() } Time taken: 77 ms res1: Long = 1000 ``` ## How was this patch tested? I tested this interactively in spark-shell. Author: Reynold Xin <rxin@databricks.com> Closes #16140 from rxin/SPARK-18714.
*	[SPARK-18740] Log spark.app.name in driver logs	Peter Ableda	2016-12-06	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Added simple logInfo line to print out the `spark.app.name` in the driver logs ## How was this patch tested? Spark was built and tested with SparkPi app. Example log: ``` 16/12/06 05:49:50 INFO spark.SparkContext: Running Spark version 2.0.0 16/12/06 05:49:52 INFO spark.SparkContext: Submitted application: Spark Pi 16/12/06 05:49:52 INFO spark.SecurityManager: Changing view acls to: root 16/12/06 05:49:52 INFO spark.SecurityManager: Changing modify acls to: root ``` Author: Peter Ableda <peter.ableda@cloudera.com> Closes #16172 from peterableda/feature/print_appname.
*	[SPARK-18634][SQL][TRIVIAL] Touch-up Generate	Herman van Hovell	2016-12-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? I jumped the gun on merging https://github.com/apache/spark/pull/16120, and missed a tiny potential problem. This PR fixes that by changing a val into a def; this should prevent potential serialization/initialization weirdness from happening. ## How was this patch tested? Existing tests. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #16170 from hvanhovell/SPARK-18634.
*	[SPARK-18721][SS] Fix ForeachSink with watermark + append	Shixiong Zhu	2016-12-05	2	-34/+79
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Right now ForeachSink creates a new physical plan, so StreamExecution cannot retrieval metrics and watermark. This PR changes ForeachSink to manually convert InternalRows to objects without creating a new plan. ## How was this patch tested? `test("foreach with watermark: append")`. Author: Shixiong Zhu <shixiong@databricks.com> Closes #16160 from zsxwing/SPARK-18721.
*	[SPARK-18672][CORE] Close recordwriter in SparkHadoopMapReduceWriter before ↵	hyukjinkwon	2016-12-06	1	-5/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	committing ## What changes were proposed in this pull request? It seems some APIs such as `PairRDDFunctions.saveAsHadoopDataset()` do not close the record writer before issuing the commit for the task. On Windows, the output in the temp directory is being open and output committer tries to rename it from temp directory to the output directory after finishing writing. So, it fails to move the file. It seems we should close the writer actually before committing the task like the other writers such as `FileFormatWriter`. Identified failure was as below: ``` FAILURE! - in org.apache.spark.JavaAPISuite writeWithNewAPIHadoopFile(org.apache.spark.JavaAPISuite) Time elapsed: 0.25 sec <<< ERROR! org.apache.spark.SparkException: Job aborted. at org.apache.spark.JavaAPISuite.writeWithNewAPIHadoopFile(JavaAPISuite.java:1231) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.org$apache$spark$internal$io$SparkHadoopMapReduceWriter$$executeTask(SparkHadoopMapReduceWriter.scala:182) at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$3.apply(SparkHadoopMapReduceWriter.scala:100) at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$3.apply(SparkHadoopMapReduceWriter.scala:99) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Could not rename file:/C:/projects/spark/core/target/tmp/1480553515529-0/output/_temporary/0/_temporary/attempt_20161201005155_0000_r_000000_0 to file:/C:/projects/spark/core/target/tmp/1480553515529-0/output/_temporary/0/task_20161201005155_0000_r_000000 at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:436) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:415) at org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:50) at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:76) at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitTask(HadoopMapReduceCommitProtocol.scala:153) at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$4.apply(SparkHadoopMapReduceWriter.scala:167) at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$4.apply(SparkHadoopMapReduceWriter.scala:156) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.org$apache$spark$internal$io$SparkHadoopMapReduceWriter$$executeTask(SparkHadoopMapReduceWriter.scala:168) ... 8 more Driver stacktrace: at org.apache.spark.JavaAPISuite.writeWithNewAPIHadoopFile(JavaAPISuite.java:1231) Caused by: org.apache.spark.SparkException: Task failed while writing rows Caused by: java.io.IOException: Could not rename file:/C:/projects/spark/core/target/tmp/1480553515529-0/output/_temporary/0/_temporary/attempt_20161201005155_0000_r_000000_0 to file:/C:/projects/spark/core/target/tmp/1480553515529-0/output/_temporary/0/task_20161201005155_0000_r_000000 ``` This PR proposes to close this before committing the task. ## How was this patch tested? Manually tested via AppVeyor. Before https://ci.appveyor.com/project/spark-test/spark/build/94-scala-tests After https://ci.appveyor.com/project/spark-test/spark/build/93-scala-tests Author: hyukjinkwon <gurwls223@gmail.com> Closes #16098 from HyukjinKwon/close-wirter-first.
*	[SPARK-18572][SQL] Add a method `listPartitionNames` to `ExternalCatalog`	Michael Allman	2016-12-06	12	-30/+221
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	(Link to Jira issue: https://issues.apache.org/jira/browse/SPARK-18572) ## What changes were proposed in this pull request? Currently Spark answers the `SHOW PARTITIONS` command by fetching all of the table's partition metadata from the external catalog and constructing partition names therefrom. The Hive client has a `getPartitionNames` method which is many times faster for this purpose, with the performance improvement scaling with the number of partitions in a table. To test the performance impact of this PR, I ran the `SHOW PARTITIONS` command on two Hive tables with large numbers of partitions. One table has ~17,800 partitions, and the other has ~95,000 partitions. For the purposes of this PR, I'll call the former table `table1` and the latter table `table2`. I ran 5 trials for each table with before-and-after versions of this PR. The results are as follows: Spark at bdc8153, `SHOW PARTITIONS table1`, times in seconds: 7.901 3.983 4.018 4.331 4.261 Spark at bdc8153, `SHOW PARTITIONS table2` (Timed out after 10 minutes with a `SocketTimeoutException`.) Spark at this PR, `SHOW PARTITIONS table1`, times in seconds: 3.801 0.449 0.395 0.348 0.336 Spark at this PR, `SHOW PARTITIONS table2`, times in seconds: 5.184 1.63 1.474 1.519 1.41 Taking the best times from each trial, we get a 12x performance improvement for a table with ~17,800 partitions and at least a 426x improvement for a table with ~95,000 partitions. More significantly, the latter command doesn't even complete with the current code in master. This is actually a patch we've been using in-house at VideoAmp since Spark 1.1. It's made all the difference in the practical usability of our largest tables. Even with tables with about 1,000 partitions there's a performance improvement of about 2-3x. ## How was this patch tested? I added a unit test to `VersionsSuite` which tests that the Hive client's `getPartitionNames` method returns the correct number of partitions. Author: Michael Allman <michael@videoamp.com> Closes #15998 from mallman/spark-18572-list_partition_names.
*	[SPARK-18722][SS] Move no data rate limit from StreamExecution to ↵	Shixiong Zhu	2016-12-05	3	-24/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ProgressReporter ## What changes were proposed in this pull request? Move no data rate limit from StreamExecution to ProgressReporter to make `recentProgresses` and listener events consistent. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16155 from zsxwing/SPARK-18722.
*	[SPARK-18555][SQL] DataFrameNaFunctions.fill miss up original values in long ↵	root	2016-12-05	2	-27/+80
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	integers ## What changes were proposed in this pull request? DataSet.na.fill(0) used on a DataSet which has a long value column, it will change the original long value. The reason is that the type of the function fill's param is Double, and the numeric columns are always cast to double(`fillCol[Double](f, value)`) . ``` def fill(value: Double, cols: Seq[String]): DataFrame = { val columnEquals = df.sparkSession.sessionState.analyzer.resolver val projections = df.schema.fields.map { f => // Only fill if the column is part of the cols list. if (f.dataType.isInstanceOf[NumericType] && cols.exists(col => columnEquals(f.name, col))) { fillCol[Double](f, value) } else { df.col(f.name) } } df.select(projections : _*) } ``` For example: ``` scala> val df = Seq[(Long, Long)]((1, 2), (-1, -2), (9123146099426677101L, 9123146560113991650L)).toDF("a", "b") df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint] scala> df.show +-------------------+-------------------+ \| a\| b\| +-------------------+-------------------+ \| 1\| 2\| \| -1\| -2\| \|9123146099426677101\|9123146560113991650\| +-------------------+-------------------+ scala> df.na.fill(0).show +-------------------+-------------------+ \| a\| b\| +-------------------+-------------------+ \| 1\| 2\| \| -1\| -2\| \|9123146099426676736\|9123146560113991680\| +-------------------+-------------------+ ``` the original values changed [which is not we expected result]: ``` 9123146099426677101 -> 9123146099426676736 9123146560113991650 -> 9123146560113991680 ``` ## How was this patch tested? unit test added. Author: root <root@iZbp1gsnrlfzjxh82cz80vZ.(none)> Closes #15994 from windpiger/nafillMissupOriginalValue.
*	[SPARK-18720][SQL][MINOR] Code Refactoring of withColumn	gatorsmile	2016-12-06	1	-15/+1
\| \| \| \| \| \| \| \| \| \| \| \|	### What changes were proposed in this pull request? Our existing withColumn for adding metadata can simply use the existing public withColumn API. ### How was this patch tested? The existing test cases cover it. Author: gatorsmile <gatorsmile@gmail.com> Closes #16152 from gatorsmile/withColumnRefactoring.
*	[SPARK-18657][SPARK-18668] Make StreamingQuery.id persists across restart ↵	Tathagata Das	2016-12-05	19	-177/+466
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	and not auto-generate StreamingQuery.name ## What changes were proposed in this pull request? Here are the major changes in this PR. - Added the ability to recover `StreamingQuery.id` from checkpoint location, by writing the id to `checkpointLoc/metadata`. - Added `StreamingQuery.runId` which is unique for every query started and does not persist across restarts. This is to identify each restart of a query separately (same as earlier behavior of `id`). - Removed auto-generation of `StreamingQuery.name`. The purpose of name was to have the ability to define an identifier across restarts, but since id is precisely that, there is no need for a auto-generated name. This means name becomes purely cosmetic, and is null by default. - Added `runId` to `StreamingQueryListener` events and `StreamingQueryProgress`. Implementation details - Renamed existing `StreamExecutionMetadata` to `OffsetSeqMetadata`, and moved it to the file `OffsetSeq.scala`, because that is what this metadata is tied to. Also did some refactoring to make the code cleaner (got rid of a lot of `.json` and `.getOrElse("{}")`). - Added the `id` as the new `StreamMetadata`. - When a StreamingQuery is created it gets or writes the `StreamMetadata` from `checkpointLoc/metadata`. - All internal logging in `StreamExecution` uses `(name, id, runId)` instead of just `name` TODO - [x] Test handling of name=null in json generation of StreamingQueryProgress - [x] Test handling of name=null in json generation of StreamingQueryListener events - [x] Test python API of runId ## How was this patch tested? Updated unit tests and new unit tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #16113 from tdas/SPARK-18657.
*	[SPARK-18729][SS] Move DataFrame.collect out of synchronized block in MemorySink	Shixiong Zhu	2016-12-05	1	-6/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Move DataFrame.collect out of synchronized block so that we can query content in MemorySink when `DataFrame.collect` is running. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16162 from zsxwing/SPARK-18729.
*	[SPARK-18634][PYSPARK][SQL] Corruption and Correctness issues with exploding ↵	Liang-Chi Hsieh	2016-12-05	4	-10/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Python UDFs ## What changes were proposed in this pull request? As reported in the Jira, there are some weird issues with exploding Python UDFs in SparkSQL. The following test code can reproduce it. Notice: the following test code is reported to return wrong results in the Jira. However, as I tested on master branch, it causes exception and so can't return any result. >>> from pyspark.sql.functions import * >>> from pyspark.sql.types import * >>> >>> df = spark.range(10) >>> >>> def return_range(value): ... return [(i, str(i)) for i in range(value - 1, value + 1)] ... >>> range_udf = udf(return_range, ArrayType(StructType([StructField("integer_val", IntegerType()), ... StructField("string_val", StringType())]))) >>> >>> df.select("id", explode(range_udf(df.id))).show() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/spark/python/pyspark/sql/dataframe.py", line 318, in show print(self._jdf.showString(n, 20)) File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/spark/python/pyspark/sql/utils.py", line 63, in deco return f(a, *kw) File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o126.showString.: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156) at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:120) at org.apache.spark.sql.execution.GenerateExec.consume(GenerateExec.scala:57) The cause of this issue is, in `ExtractPythonUDFs` we insert `BatchEvalPythonExec` to run PythonUDFs in batch. `BatchEvalPythonExec` will add extra outputs (e.g., `pythonUDF0`) to original plan. In above case, the original `Range` only has one output `id`. After `ExtractPythonUDFs`, the added `BatchEvalPythonExec` has two outputs `id` and `pythonUDF0`. Because the output of `GenerateExec` is given after analysis phase, in above case, it is the combination of `id`, i.e., the output of `Range`, and `col`. But in planning phase, we change `GenerateExec`'s child plan to `BatchEvalPythonExec` with additional output attributes. It will cause no problem in non wholestage codegen. Because when evaluating the additional attributes are projected out the final output of `GenerateExec`. However, as `GenerateExec` now supports wholestage codegen, the framework will input all the outputs of the child plan to `GenerateExec`. Then when consuming `GenerateExec`'s output data (i.e., calling `consume`), the number of output attributes is different to the output variables in wholestage codegen. To solve this issue, this patch only gives the generator's output to `GenerateExec` after analysis phase. `GenerateExec`'s output is the combination of its child plan's output and the generator's output. So when we change `GenerateExec`'s child, its output is still correct. ## How was this patch tested? Added test cases to PySpark. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #16120 from viirya/fix-py-udf-with-generator.
*	[SPARK-18719] Add spark.ui.showConsoleProgress to configuration docs	Nicholas Chammas	2016-12-05	1	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \|	This PR adds `spark.ui.showConsoleProgress` to the configuration docs. I tested this PR by building the docs locally and confirming that this change shows up as expected. Relates to #3029. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #16151 from nchammas/ui-progressbar-doc.
*	[DOCS][MINOR] Update location of Spark YARN shuffle jar	Nicholas Chammas	2016-12-05	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Looking at the distributions provided on spark.apache.org, I see that the Spark YARN shuffle jar is under `yarn/` and not `lib/`. This change is so minor I'm not sure it needs a JIRA. But let me know if so and I'll create one. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #16130 from nchammas/yarn-doc-fix.
*	[SPARK-18711][SQL] should disable subexpression elimination for LambdaVariable	Wenchen Fan	2016-12-05	2	-5/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This is kind of a long-standing bug, it's hidden until https://github.com/apache/spark/pull/15780 , which may add `AssertNotNull` on top of `LambdaVariable` and thus enables subexpression elimination. However, subexpression elimination will evaluate the common expressions at the beginning, which is invalid for `LambdaVariable`. `LambdaVariable` usually represents loop variable, which can't be evaluated ahead of the loop. This PR skips expressions containing `LambdaVariable` when doing subexpression elimination. ## How was this patch tested? updated test in `DatasetAggregatorSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #16143 from cloud-fan/aggregator.
*	[SPARK-18694][SS] Add StreamingQuery.explain and exception to Python and fix ↵	Shixiong Zhu	2016-12-05	8	-25/+119
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	StreamingQueryException ## What changes were proposed in this pull request? - Add StreamingQuery.explain and exception to Python. - Fix StreamingQueryException to not expose `OffsetSeq`. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16125 from zsxwing/py-streaming-explain.
*	[MINOR][DOC] Use SparkR `TRUE` value and add default values for ↵	Dongjoon Hyun	2016-12-05	1	-5/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	`StructField` in SQL Guide. ## What changes were proposed in this pull request? In `SQL Programming Guide`, this PR uses `TRUE` instead of `True` in SparkR and adds default values of `nullable` for `StructField` in Scala/Python/R (i.e., "Note: The default value of nullable is true."). In Java API, `nullable` is not optional. BEFORE * SPARK 2.1.0 RC1 http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc1-docs/sql-programming-guide.html#data-types AFTER * R <img width="916" alt="screen shot 2016-12-04 at 11 58 19 pm" src="https://cloud.githubusercontent.com/assets/9700541/20877443/abba19a6-ba7d-11e6-8984-afbe00333fb0.png"> * Scala <img width="914" alt="screen shot 2016-12-04 at 11 57 37 pm" src="https://cloud.githubusercontent.com/assets/9700541/20877433/99ce734a-ba7d-11e6-8bb5-e8619041b09b.png"> * Python <img width="914" alt="screen shot 2016-12-04 at 11 58 04 pm" src="https://cloud.githubusercontent.com/assets/9700541/20877440/a5c89338-ba7d-11e6-8f92-6c0ae9388d7e.png"> ## How was this patch tested? Manual. ``` cd docs SKIP_API=1 jekyll build open _site/index.html ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16141 from dongjoon-hyun/SPARK-SQL-GUIDE.
*	[SPARK-18279][DOC][ML][SPARKR] Add R examples to ML programming guide.	Yanbo Liang	2016-12-05	2	-0/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add R examples to ML programming guide for the following algorithms as POC: * spark.glm * spark.survreg * spark.naiveBayes * spark.kmeans The four algorithms were added to SparkR since 2.0.0, more docs for algorithms added during 2.1 release cycle will be addressed in a separate follow-up PR. ## How was this patch tested? This is the screenshots of generated ML programming guide for ```GeneralizedLinearRegression```: ![image](https://cloud.githubusercontent.com/assets/1962026/20866403/babad856-b9e1-11e6-9984-62747801e8c4.png) Author: Yanbo Liang <ybliang8@gmail.com> Closes #16136 from yanboliang/spark-18279.
*	[SPARK-18625][ML] OneVsRestModel should support setFeaturesCol and ↵	Zheng RuiFeng	2016-12-05	2	-1/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	setPredictionCol ## What changes were proposed in this pull request? add `setFeaturesCol` and `setPredictionCol` for `OneVsRestModel` ## How was this patch tested? added tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #16059 from zhengruifeng/ovrm_setCol.
*	[SPARK-18702][SQL] input_file_block_start and input_file_block_length	Reynold Xin	2016-12-04	10	-152/+268
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We currently have function input_file_name to get the path of the input file, but don't have functions to get the block start offset and length. This patch introduces two functions: 1. input_file_block_start: returns the file block start offset, or -1 if not available. 2. input_file_block_length: returns the file block length, or -1 if not available. ## How was this patch tested? Updated existing test cases in ColumnExpressionSuite that covered input_file_name to also cover the two new functions. Author: Reynold Xin <rxin@databricks.com> Closes #16133 from rxin/SPARK-18702.
*	[SPARK-18643][SPARKR] SparkR hangs at session start when installed as a ↵	Felix Cheung	2016-12-04	3	-4/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	package without Spark ## What changes were proposed in this pull request? If SparkR is running as a package and it has previously downloaded Spark Jar it should be able to run as before without having to set SPARK_HOME. Basically with this bug the auto install Spark will only work in the first session. This seems to be a regression on the earlier behavior. Fix is to always try to install or check for the cached Spark if running in an interactive session. As discussed before, we should probably only install Spark iff running in an interactive session (R shell, RStudio etc) ## How was this patch tested? Manually Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16077 from felixcheung/rsessioninteractive.
*	[SPARK-18661][SQL] Creating a partitioned datasource table should not scan ↵	Eric Liang	2016-12-04	4	-8/+66
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	all files for table ## What changes were proposed in this pull request? Even though in 2.1 creating a partitioned datasource table will not populate the partition data by default (until the user issues MSCK REPAIR TABLE), it seems we still scan the filesystem for no good reason. We should avoid doing this when the user specifies a schema. ## How was this patch tested? Perf stat tests. Author: Eric Liang <ekl@databricks.com> Closes #16090 from ericl/spark-18661.
*	[SPARK-18091][SQL] Deep if expressions cause Generated ↵	Kapil Singh	2016-12-04	2	-13/+90
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SpecificUnsafeProjection code to exceed JVM code size limit ## What changes were proposed in this pull request? Fix for SPARK-18091 which is a bug related to large if expressions causing generated SpecificUnsafeProjection code to exceed JVM code size limit. This PR changes if expression's code generation to place its predicate, true value and false value expressions' generated code in separate methods in context so as to never generate too long combined code. ## How was this patch tested? Added a unit test and also tested manually with the application (having transformations similar to the unit test) which caused the issue to be identified in the first place. Author: Kapil Singh <kapsingh@adobe.com> Closes #15620 from kapilsingh5050/SPARK-18091-IfCodegenFix.
*	[MINOR][README] Correct Markdown link inside readme	linbojin	2016-12-03	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? "Useful Developer Tools" link inside [README.md](https://github.com/apache/spark/blob/master/README.md#building-spark) doesn't work on master branch. This pr corrects this Markdown link. ## How was this patch tested? [README.md](https://github.com/linbojin/spark/blob/fix-markdown-link-in-readme/README.md#building-spark) on this branch ![image](https://cloud.githubusercontent.com/assets/5894707/20864124/4c83499e-ba1e-11e6-9948-07b4627f516f.png) srowen Author: linbojin <linbojin203@gmail.com> Closes #16132 from linbojin/fix-markdown-link-in-readme.
*	[SPARK-18081][ML][DOCS] Add user guide for Locality Sensitive Hashing(LSH)	Yunni	2016-12-03	5	-0/+436
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The user guide for LSH is added to ml-features.md, with several scala/java examples in spark-examples. ## How was this patch tested? Doc has been generated through Jekyll, and checked through manual inspection. Author: Yunni <Euler57721@gmail.com> Author: Yun Ni <yunn@uber.com> Author: Joseph K. Bradley <joseph@databricks.com> Author: Yun Ni <Euler57721@gmail.com> Closes #15795 from Yunni/SPARK-18081-lsh-guide.
*	[SPARK-18582][SQL] Whitelist LogicalPlan operators allowed in correlated ↵	Nattavut Sutyanyong	2016-12-03	4	-53/+129
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	subqueries ## What changes were proposed in this pull request? This fix puts an explicit list of operators that Spark supports for correlated subqueries. ## How was this patch tested? Run sql/test, catalyst/test and add a new test case on Generate. Author: Nattavut Sutyanyong <nsy.can@gmail.com> Closes #16046 from nsyca/spark18455.0.
*	[SPARK-18638][BUILD] Upgrade sbt, Zinc, and Maven plugins	Weiqing Yang	2016-12-03	5	-39/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR is to upgrade: ``` sbt: 0.13.11 -> 0.13.13, zinc: 0.3.9 -> 0.3.11, maven-assembly-plugin: 2.6 -> 3.0.0 maven-compiler-plugin: 3.5.1 -> 3.6. maven-jar-plugin: 2.6 -> 3.0.2 maven-javadoc-plugin: 2.10.3 -> 2.10.4 maven-source-plugin: 2.4 -> 3.0.1 org.codehaus.mojo:build-helper-maven-plugin: 1.10 -> 1.12 org.codehaus.mojo:exec-maven-plugin: 1.4.0 -> 1.5.0 ``` The sbt release notes since the last version we used are: [v0.13.12](https://github.com/sbt/sbt/releases/tag/v0.13.12) and [v0.13.13 ](https://github.com/sbt/sbt/releases/tag/v0.13.13). ## How was this patch tested? Pass build and the existing tests. Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #16069 from weiqingy/SPARK-18638.
*	[SPARK-18685][TESTS] Fix URI and release resources after opening in tests at ↵	hyukjinkwon	2016-12-03	1	-8/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ExecutorClassLoaderSuite ## What changes were proposed in this pull request? This PR fixes two problems as below: - Close `BufferedSource` after `Source.fromInputStream(...)` to release resource and make the tests pass on Windows in `ExecutorClassLoaderSuite` ``` [info] Exception encountered when attempting to run a suite with class name: org.apache.spark.repl.ExecutorClassLoaderSuite * ABORTED * (7 seconds, 333 milliseconds) [info] java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-77b2f37b-6405-47c4-af1c-4a6a206511f2 [info] at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010) [info] at org.apache.spark.repl.ExecutorClassLoaderSuite.afterAll(ExecutorClassLoaderSuite.scala:76) [info] at org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213) ... ``` - Fix URI correctly so that related tests can be passed on Windows. ``` [info] - child first * FAILED * (78 milliseconds) [info] java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b [info] at java.net.URI$Parser.fail(URI.java:2848) [info] at java.net.URI$Parser.parseAuthority(URI.java:3186) ... [info] - parent first * FAILED * (15 milliseconds) [info] java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b [info] at java.net.URI$Parser.fail(URI.java:2848) [info] at java.net.URI$Parser.parseAuthority(URI.java:3186) ... [info] - child first can fall back * FAILED * (0 milliseconds) [info] java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b [info] at java.net.URI$Parser.fail(URI.java:2848) [info] at java.net.URI$Parser.parseAuthority(URI.java:3186) ... [info] - child first can fail * FAILED * (0 milliseconds) [info] java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b [info] at java.net.URI$Parser.fail(URI.java:2848) [info] at java.net.URI$Parser.parseAuthority(URI.java:3186) ... [info] - resource from parent * FAILED * (0 milliseconds) [info] java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b [info] at java.net.URI$Parser.fail(URI.java:2848) [info] at java.net.URI$Parser.parseAuthority(URI.java:3186) ... [info] - resources from parent * FAILED * (0 milliseconds) [info] java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b [info] at java.net.URI$Parser.fail(URI.java:2848) [info] at java.net.URI$Parser.parseAuthority(URI.java:3186) ``` ## How was this patch tested? Manually tested via AppVeyor. Before https://ci.appveyor.com/project/spark-test/spark/build/102-rpel-ExecutorClassLoaderSuite After https://ci.appveyor.com/project/spark-test/spark/build/108-rpel-ExecutorClassLoaderSuite Author: hyukjinkwon <gurwls223@gmail.com> Closes #16116 from HyukjinKwon/close-after-open.
*	[SPARK-18586][BUILD] netty-3.8.0.Final.jar has vulnerability CVE-2014-3488 ↵	Sean Owen	2016-12-03	6	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	and CVE-2014-0193 ## What changes were proposed in this pull request? Force update to latest Netty 3.9.x, for dependencies like Flume, to resolve two CVEs. 3.9.2 is the first version that resolves both, and, this is the latest in the 3.9.x line. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #16102 from srowen/SPARK-18586.
*	[SPARK-18362][SQL] Use TextFileFormat in implementation of CSVFileFormat	Josh Rosen	2016-12-02	2	-36/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch significantly improves the IO / file listing performance of schema inference in Spark's built-in CSV data source. Previously, this data source used the legacy `SparkContext.hadoopFile` and `SparkContext.hadoopRDD` methods to read files during its schema inference step, causing huge file-listing bottlenecks on the driver. This patch refactors this logic to use Spark SQL's `text` data source to read files during this step. The text data source still performs some unnecessary file listing (since in theory we already have resolved the table prior to schema inference and therefore should be able to scan without performing _any_ extra listing), but that listing is much faster and takes place in parallel. In one production workload operating over tens of thousands of files, this change managed to reduce schema inference time from 7 minutes to 2 minutes. A similar problem also affects the JSON file format and this patch originally fixed that as well, but I've decided to split that change into a separate patch so as not to conflict with changes in another JSON PR. ## How was this patch tested? Existing unit tests, plus manual benchmarking on a production workload. Author: Josh Rosen <joshrosen@databricks.com> Closes #15813 from JoshRosen/use-text-data-source-in-csv-and-json.
*	[SPARK-18695] Bump master branch version to 2.2.0-SNAPSHOT	Reynold Xin	2016-12-02	38	-39/+40
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch bumps master branch version to 2.2.0-SNAPSHOT. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #16126 from rxin/SPARK-18695.
*	[SPARK-18690][PYTHON][SQL] Backward compatibility of unbounded frames	zero323	2016-12-02	2	-14/+51
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Makes `Window.unboundedPreceding` and `Window.unboundedFollowing` backward compatible. ## How was this patch tested? Pyspark SQL unittests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: zero323 <zero323@users.noreply.github.com> Closes #16123 from zero323/SPARK-17845-follow-up.
*	[SPARK-18324][ML][DOC] Update ML programming and migration guide for 2.1 release	Yanbo Liang	2016-12-02	2	-134/+163
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Update ML programming and migration guide for 2.1 release. ## How was this patch tested? Doc change, no test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16076 from yanboliang/spark-18324.
*	[SPARK-18670][SS] Limit the number of ↵	Shixiong Zhu	2016-12-02	3	-1/+71
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	StreamingQueryListener.StreamProgressEvent when there is no data ## What changes were proposed in this pull request? This PR adds a sql conf `spark.sql.streaming.noDataReportInterval` to control how long to wait before outputing the next StreamProgressEvent when there is no data. ## How was this patch tested? The added unit test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #16108 from zsxwing/SPARK-18670.
*	[SPARK-18291][SPARKR][ML] Revert "[SPARK-18291][SPARKR][ML] SparkR glm ↵	Yanbo Liang	2016-12-02	2	-86/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	predict should output original label when family = binomial." ## What changes were proposed in this pull request? It's better we can fix this issue by providing an option ```type``` for users to change the ```predict``` output schema, then they could output probabilities, log-space predictions, or original labels. In order to not involve breaking API change for 2.1, so revert this change firstly and will add it back after [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618) resolved. ## How was this patch tested? Existing unit tests. This reverts commit daa975f4bfa4f904697bf3365a4be9987032e490. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16118 from yanboliang/spark-18291-revert.
*	[SPARK-18677] Fix parsing ['key'] in JSON path expressions.	Ryan Blue	2016-12-02	2	-1/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This fixes the parser rule to match named expressions, which doesn't work for two reasons: 1. The name match is not coerced to a regular expression (missing .r) 2. The surrounding literals are incorrect and attempt to escape a single quote, which is unnecessary ## How was this patch tested? This adds test cases for named expressions using the bracket syntax, including one with quoted spaces. Author: Ryan Blue <blue@apache.org> Closes #16107 from rdblue/SPARK-18677-fix-json-path.
*	[SPARK-18674][SQL][FOLLOW-UP] improve the error message of using join	gatorsmile	2016-12-02	2	-7/+17
\| \| \| \| \| \| \| \| \| \| \| \|	### What changes were proposed in this pull request? Added a test case for using joins with nested fields. ### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #16110 from gatorsmile/followup-18674.
*	[SPARK-18659][SQL] Incorrect behaviors in overwrite table for datasource tables	Eric Liang	2016-12-02	14	-37/+110
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Two bugs are addressed here 1. INSERT OVERWRITE TABLE sometime crashed when catalog partition management was enabled. This was because when dropping partitions after an overwrite operation, the Hive client will attempt to delete the partition files. If the entire partition directory was dropped, this would fail. The PR fixes this by adding a flag to control whether the Hive client should attempt to delete files. 2. The static partition spec for OVERWRITE TABLE was not correctly resolved to the case-sensitive original partition names. This resulted in the entire table being overwritten if you did not correctly capitalize your partition names. cc yhuai cloud-fan ## How was this patch tested? Unit tests. Surprisingly, the existing overwrite table tests did not catch these edge cases. Author: Eric Liang <ekl@databricks.com> Closes #16088 from ericl/spark-18659.
*	[SPARK-18419][SQL] `JDBCRelation.insert` should not remove Spark options	Dongjoon Hyun	2016-12-02	4	-8/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Currently, `JDBCRelation.insert` removes Spark options too early by mistakenly using `asConnectionProperties`. Spark options like `numPartitions` should be passed into `DataFrameWriter.jdbc` correctly. This bug have been hidden because `JDBCOptions.asConnectionProperties` fails to filter out the mixed-case options. This PR aims to fix both. JDBCRelation.insert ```scala override def insert(data: DataFrame, overwrite: Boolean): Unit = { val url = jdbcOptions.url val table = jdbcOptions.table - val properties = jdbcOptions.asConnectionProperties + val properties = jdbcOptions.asProperties data.write .mode(if (overwrite) SaveMode.Overwrite else SaveMode.Append) .jdbc(url, table, properties) ``` JDBCOptions.asConnectionProperties ```scala scala> import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions scala> import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap scala> new JDBCOptions(Map("url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10")).asConnectionProperties res0: java.util.Properties = {numpartitions=10} scala> new JDBCOptions(new CaseInsensitiveMap(Map("url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10"))).asConnectionProperties res1: java.util.Properties = {numpartitions=10} ``` ## How was this patch tested? Pass the Jenkins with a new testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #15863 from dongjoon-hyun/SPARK-18419.
*	[SPARK-18679][SQL] Fix regression in file listing performance for ↵	Eric Liang	2016-12-02	3	-34/+106
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	non-catalog tables ## What changes were proposed in this pull request? In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to InMemoryFileIndex). This introduced a regression where parallelism could only be introduced at the very top of the tree. However, in many cases (e.g. `spark.read.parquet(topLevelDir)`), the top of the tree is only a single directory. This PR simplifies and fixes the parallel recursive listing code to allow parallelism to be introduced at any level during recursive descent (though note that once we decide to list a sub-tree in parallel, the sub-tree is listed in serial on executors). cc mallman cloud-fan ## How was this patch tested? Checked metrics in unit tests. Author: Eric Liang <ekl@databricks.com> Closes #16112 from ericl/spark-18679.
*	[SPARK-18629][SQL] Fix numPartition of JDBCSuite Testcase	Weiqing Yang	2016-12-02	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Fix numPartition of JDBCSuite Testcase. ## How was this patch tested? Before: Run any one of the test cases in JDBCSuite, you will get the following warning. ``` 10:34:26.389 WARN org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation: The number of partitions is reduced because the specified number of partitions is less than the difference between upper bound and lower bound. Updated number of partitions: 3; Input number of partitions: 4; Lower bound: 1; Upper bound: 4. ``` After: Pass tests without the warning. Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #16062 from weiqingy/SPARK-18629.
*	[SPARK-17213][SQL] Disable Parquet filter push-down for string and binary ↵	Cheng Lian	2016-12-01	2	-3/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	columns due to PARQUET-686 This PR targets to both master and branch-2.1. ## What changes were proposed in this pull request? Due to PARQUET-686, Parquet doesn't do string comparison correctly while doing filter push-down for string columns. This PR disables filter push-down for both string and binary columns to work around this issue. Binary columns are also affected because some Parquet data models (like Hive) may store string columns as a plain Parquet `binary` instead of a `binary (UTF8)`. ## How was this patch tested? New test case added in `ParquetFilterSuite`. Author: Cheng Lian <lian@databricks.com> Closes #16106 from liancheng/spark-17213-bad-string-ppd.