spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-19041][SS] Fix code snippet compilation issues in Structured ↵	Liwei Lin	2017-01-02	1	-36/+51
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Streaming Programming Guide ## What changes were proposed in this pull request? Currently some code snippets in the programming guide just do not compile. We should fix them. ## How was this patch tested? ``` SKIP_API=1 jekyll build ``` ## Screenshot from part of the change: ![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png) Author: Liwei Lin <lwlin7@gmail.com> Closes #16442 from lw-lin/ss-pro-guide-.
*	[BUILD] Close stale PRs	Sean Owen	2017-01-02	0	-0/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Closes #12968 Closes #16215 Closes #16212 Closes #16086 Closes #15713 Closes #16413 Closes #16396 Author: Sean Owen <sowen@cloudera.com> Closes #16447 from srowen/CloseStalePRs.
*	[SPARK-19050][SS][TESTS] Fix EventTimeWatermarkSuite 'delay in months and ↵	Shixiong Zhu	2017-01-01	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	years handled correctly' ## What changes were proposed in this pull request? `monthsSinceEpoch` in this test is like `math.floor(num)`, so `monthDiff` has two possible values. ## How was this patch tested? Jenkins. Author: Shixiong Zhu <shixiong@databricks.com> Closes #16449 from zsxwing/watermark-test-hotfix.
*	[SPARK-19028][SQL] Fixed non-thread-safe functions used in SessionCatalog	gatorsmile	2016-12-31	2	-18/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	### What changes were proposed in this pull request? Fixed non-thread-safe functions used in SessionCatalog: - refreshTable - lookupRelation ### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #16437 from gatorsmile/addSyncToLookUpTable.
*	[SPARK-19016][SQL][DOC] Document scalable partition handling	Cheng Lian	2016-12-30	1	-4/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR documents the scalable partition handling feature in the body of the programming guide. Before this PR, we only mention it in the migration guide. It's not super clear that external datasource tables require an extra `MSCK REPAIR TABLE` command is to have per-partition information persisted since 2.1. ## How was this patch tested? N/A. Author: Cheng Lian <lian@databricks.com> Closes #16424 from liancheng/scalable-partition-handling-doc.
*	[SPARK-18123][SQL] Use db column names instead of RDD column ones during ↵	Dongjoon Hyun	2016-12-30	3	-25/+95
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	JDBC Writing ## What changes were proposed in this pull request? Apache Spark supports the following cases by quoting RDD column names while saving through JDBC. - Allow reserved keyword as a column name, e.g., 'order'. - Allow mixed-case colume names like the following, e.g., `[a: int, A: int]`. ``` scala scala> val df = sql("select 1 a, 1 A") df: org.apache.spark.sql.DataFrame = [a: int, A: int] ... scala> df.write.mode("overwrite").format("jdbc").options(option).save() scala> df.write.mode("append").format("jdbc").options(option).save() ``` This PR aims to use database column names instead of RDD column ones in order to support the following additionally. Note that this case succeeds with `MySQL`, but fails on `Postgres`/`Oracle` before. ``` scala val df1 = sql("select 1 a") val df2 = sql("select 1 A") ... df1.write.mode("overwrite").format("jdbc").options(option).save() df2.write.mode("append").format("jdbc").options(option).save() ``` ## How was this patch tested? Pass the Jenkins test with a new testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Author: gatorsmile <gatorsmile@gmail.com> Closes #15664 from dongjoon-hyun/SPARK-18123.
*	[SPARK-18922][TESTS] Fix more path-related test failures on Windows	hyukjinkwon	2016-12-30	20	-105/+106
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR proposes to fix the test failures due to different format of paths on Windows. Failed tests are as below: ``` ColumnExpressionSuite: - input_file_name, input_file_block_start, input_file_block_length - FileScanRDD * FAILED * (187 milliseconds) "file:///C:/projects/spark/target/tmp/spark-0b21b963-6cfa-411c-8d6f-e6a5e1e73bce/part-00001-c083a03a-e55e-4b05-9073-451de352d006.snappy.parquet" did not contain "C:\projects\spark\target\tmp\spark-0b21b963-6cfa-411c-8d6f-e6a5e1e73bce" (ColumnExpressionSuite.scala:545) - input_file_name, input_file_block_start, input_file_block_length - HadoopRDD * FAILED * (172 milliseconds) "file:/C:/projects/spark/target/tmp/spark-5d0afa94-7c2f-463b-9db9-2e8403e2bc5f/part-00000-f6530138-9ad3-466d-ab46-0eeb6f85ed0b.txt" did not contain "C:\projects\spark\target\tmp\spark-5d0afa94-7c2f-463b-9db9-2e8403e2bc5f" (ColumnExpressionSuite.scala:569) - input_file_name, input_file_block_start, input_file_block_length - NewHadoopRDD * FAILED * (156 milliseconds) "file:/C:/projects/spark/target/tmp/spark-a894c7df-c74d-4d19-82a2-a04744cb3766/part-00000-29674e3f-3fcf-4327-9b04-4dab1d46338d.txt" did not contain "C:\projects\spark\target\tmp\spark-a894c7df-c74d-4d19-82a2-a04744cb3766" (ColumnExpressionSuite.scala:598) ``` ``` DataStreamReaderWriterSuite: - source metadataPath * FAILED * (62 milliseconds) org.mockito.exceptions.verification.junit.ArgumentsAreDifferent: Argument(s) are different! Wanted: streamSourceProvider.createSource( org.apache.spark.sql.SQLContext3b04133b, "C:\projects\spark\target\tmp\streaming.metadata-b05db6ae-c8dc-4ce4-b0d9-1eb8c84876c0/sources/0", None, "org.apache.spark.sql.streaming.test", Map() ); -> at org.apache.spark.sql.streaming.test.DataStreamReaderWriterSuite$$anonfun$12.apply$mcV$sp(DataStreamReaderWriterSuite.scala:374) Actual invocation has different arguments: streamSourceProvider.createSource( org.apache.spark.sql.SQLContext3b04133b, "/C:/projects/spark/target/tmp/streaming.metadata-b05db6ae-c8dc-4ce4-b0d9-1eb8c84876c0/sources/0", None, "org.apache.spark.sql.streaming.test", Map() ); ``` ``` GlobalTempViewSuite: - CREATE GLOBAL TEMP VIEW USING * FAILED * (110 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-960398ba-a0a1-45f6-a59a-d98533f9f519; ``` ``` CreateTableAsSelectSuite: - CREATE TABLE USING AS SELECT * FAILED * (0 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string - create a table, drop it and create another one with the same name * FAILED * (16 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string - create table using as select - with partitioned by * FAILED * (0 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string - create table using as select - with non-zero buckets * FAILED * (0 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string ``` ``` HiveMetadataCacheSuite: - partitioned table is cached when partition pruning is true * FAILED * (532 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - partitioned table is cached when partition pruning is false * FAILED * (297 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ``` ``` MultiDatabaseSuite: - createExternalTable() to non-default database - with USE * FAILED * (954 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-0839d9a7-5e29-467a-9e3e-3e4cd618ee09; - createExternalTable() to non-default database - without USE * FAILED * (500 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-c7e24d73-1d8f-45e8-ab7d-53a83087aec3; - invalid database name and table names * FAILED * (31 milliseconds) "Path does not exist: file:/C:projectsspark arget mpspark-15a2a494-3483-4876-80e5-ec396e704b77;" did not contain "`t:a` is not a valid name for tables/databases. Valid names only contain alphabet characters, numbers and _." (MultiDatabaseSuite.scala:296) ``` ``` OrcQuerySuite: - SPARK-8501: Avoids discovery schema from empty ORC files * FAILED * (15 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - Verify the ORC conversion parameter: CONVERT_METASTORE_ORC * FAILED * (78 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - converted ORC table supports resolving mixed case field * FAILED * (297 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ``` ``` HadoopFsRelationTest - JsonHadoopFsRelationSuite, OrcHadoopFsRelationSuite, ParquetHadoopFsRelationSuite, SimpleTextHadoopFsRelationSuite: - Locality support for FileScanRDD * FAILED * (15 milliseconds) java.lang.IllegalArgumentException: Wrong FS: file://C:\projects\spark\target\tmp\spark-383d1f13-8783-47fd-964d-9c75e5eec50f, expected: file:/// ``` ``` HiveQuerySuite: - CREATE TEMPORARY FUNCTION * FAILED * (0 milliseconds) java.net.MalformedURLException: For input string: "%5Cprojects%5Cspark%5Csql%5Chive%5Ctarget%5Cscala-2.11%5Ctest-classes%5CTestUDTF.jar" - ADD FILE command * FAILED * (500 milliseconds) java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\sql\hive\target\scala-2.11\test-classes\data\files\v1.txt - ADD JAR command 2 * FAILED * (110 milliseconds) org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: C:projectssparksqlhive argetscala-2.11 est-classesdatafilessample.json; ``` ``` PruneFileSourcePartitionsSuite: - PruneFileSourcePartitions should not change the output of LogicalRelation * FAILED * (15 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ``` ``` HiveCommandSuite: - LOAD DATA LOCAL * FAILED * (109 milliseconds) org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: C:projectssparksqlhive argetscala-2.11 est-classesdatafilesemployee.dat; - LOAD DATA * FAILED * (93 milliseconds) java.net.URISyntaxException: Illegal character in opaque part at index 15: C:projectsspark arget mpemployee.dat7496657117354281006.tmp - Truncate Table * FAILED * (78 milliseconds) org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: C:projectssparksqlhive argetscala-2.11 est-classesdatafilesemployee.dat; ``` ``` HiveExternalCatalogBackwardCompatibilitySuite: - make sure we can read table created by old version of Spark * FAILED * (0 milliseconds) "[/C:/projects/spark/target/tmp/]spark-0554d859-74e1-..." did not equal "[C:\projects\spark\target\tmp\]spark-0554d859-74e1-..." (HiveExternalCatalogBackwardCompatibilitySuite.scala:213) org.scalatest.exceptions.TestFailedException - make sure we can alter table location created by old version of Spark * FAILED * (110 milliseconds) java.net.URISyntaxException: Illegal character in opaque part at index 15: C:projectsspark arget mpspark-0e9b2c5f-49a1-4e38-a32a-c0ab1813a79f ``` ``` ExternalCatalogSuite: - create/drop/rename partitions should create/delete/rename the directory * FAILED * (610 milliseconds) java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-4c24f010-18df-437b-9fed-990c6f9adece ``` ``` SQLQuerySuite: - describe functions - temporary user defined functions * FAILED * (16 milliseconds) java.net.URISyntaxException: Illegal character in opaque part at index 22: C:projectssparksqlhive argetscala-2.11 est-classesTestUDTF.jar - specifying database name for a temporary table is not allowed * FAILED * (125 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-a34c9814-a483-43f2-be29-37f616b6df91; ``` ``` PartitionProviderCompatibilitySuite: - convert partition provider to hive with repair table * FAILED * (281 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-ee5fc96d-8c7d-4ebf-8571-a1d62736473e; - when partition management is enabled, new tables have partition provider hive * FAILED * (187 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-803ad4d6-3e8c-498d-9ca5-5cda5d9b2a48; - when partition management is disabled, new tables have no partition provider * FAILED * (172 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-c9fda9e2-4020-465f-8678-52cd72d0a58f; - when partition management is disabled, we preserve the old behavior even for new tables * FAILED * (203 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-f4a518a6-c49d-43d3-b407-0ddd76948e13; - insert overwrite partition of legacy datasource table * FAILED * (188 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-f4a518a6-c49d-43d3-b407-0ddd76948e79; - insert overwrite partition of new datasource table overwrites just partition * FAILED * (219 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-6ba3a88d-6f6c-42c5-a9f4-6d924a0616ff; - SPARK-18544 append with saveAsTable - partition management true * FAILED * (173 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-cd234a6d-9cb4-4d1d-9e51-854ae9543bbd; - SPARK-18635 special chars in partition values - partition management true * FAILED * (2 seconds, 967 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - SPARK-18635 special chars in partition values - partition management false * FAILED * (62 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - SPARK-18659 insert overwrite table with lowercase - partition management true * FAILED * (63 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - SPARK-18544 append with saveAsTable - partition management false * FAILED * (266 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - SPARK-18659 insert overwrite table files - partition management false * FAILED * (63 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - SPARK-18659 insert overwrite table with lowercase - partition management false * FAILED * (78 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - sanity check table setup * FAILED * (31 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - insert into partial dynamic partitions * FAILED * (47 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - insert into fully dynamic partitions * FAILED * (62 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - insert into static partition * FAILED * (78 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - overwrite partial dynamic partitions * FAILED * (63 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - overwrite fully dynamic partitions * FAILED * (47 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - overwrite static partition * FAILED * (63 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ``` ``` MetastoreDataSourcesSuite: - check change without refresh * FAILED * (203 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-00713fe4-ca04-448c-bfc7-6c5e9a2ad2a1; - drop, change, recreate * FAILED * (78 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-2030a21b-7d67-4385-a65b-bb5e2bed4861; - SPARK-15269 external data source table creation * FAILED * (78 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-4d50fd4a-14bc-41d6-9232-9554dd233f86; - CTAS * FAILED * (109 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string - CTAS with IF NOT EXISTS * FAILED * (109 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string - CTAS: persisted partitioned bucketed data source table * FAILED * (0 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string - SPARK-15025: create datasource table with path with select * FAILED * (16 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string - CTAS: persisted partitioned data source table * FAILED * (47 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string ``` ``` HiveMetastoreCatalogSuite: - Persist non-partitioned parquet relation into metastore as managed table using CTAS * FAILED * (16 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string - Persist non-partitioned orc relation into metastore as managed table using CTAS * FAILED * (16 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string ``` ``` HiveUDFSuite: - SPARK-11522 select input_file_name from non-parquet table * FAILED * (16 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ``` ``` QueryPartitionSuite: - SPARK-13709: reading partitioned Avro table with nested schema * FAILED * (250 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ``` ``` ParquetHiveCompatibilitySuite: - simple primitives * FAILED * (16 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - SPARK-10177 timestamp * FAILED * (0 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - array * FAILED * (16 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - map * FAILED * (16 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - struct * FAILED * (0 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - SPARK-16344: array of struct with a single field named 'array_element' * FAILED * (15 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ``` ## How was this patch tested? Manually tested via AppVeyor. ``` ColumnExpressionSuite: - input_file_name, input_file_block_start, input_file_block_length - FileScanRDD (234 milliseconds) - input_file_name, input_file_block_start, input_file_block_length - HadoopRDD (235 milliseconds) - input_file_name, input_file_block_start, input_file_block_length - NewHadoopRDD (203 milliseconds) ``` ``` DataStreamReaderWriterSuite: - source metadataPath (63 milliseconds) ``` ``` GlobalTempViewSuite: - CREATE GLOBAL TEMP VIEW USING (436 milliseconds) ``` ``` CreateTableAsSelectSuite: - CREATE TABLE USING AS SELECT (171 milliseconds) - create a table, drop it and create another one with the same name (422 milliseconds) - create table using as select - with partitioned by (141 milliseconds) - create table using as select - with non-zero buckets (125 milliseconds) ``` ``` HiveMetadataCacheSuite: - partitioned table is cached when partition pruning is true (3 seconds, 211 milliseconds) - partitioned table is cached when partition pruning is false (1 second, 781 milliseconds) ``` ``` MultiDatabaseSuite: - createExternalTable() to non-default database - with USE (797 milliseconds) - createExternalTable() to non-default database - without USE (640 milliseconds) - invalid database name and table names (62 milliseconds) ``` ``` OrcQuerySuite: - SPARK-8501: Avoids discovery schema from empty ORC files (703 milliseconds) - Verify the ORC conversion parameter: CONVERT_METASTORE_ORC (750 milliseconds) - converted ORC table supports resolving mixed case field (625 milliseconds) ``` ``` HadoopFsRelationTest - JsonHadoopFsRelationSuite, OrcHadoopFsRelationSuite, ParquetHadoopFsRelationSuite, SimpleTextHadoopFsRelationSuite: - Locality support for FileScanRDD (296 milliseconds) ``` ``` HiveQuerySuite: - CREATE TEMPORARY FUNCTION (125 milliseconds) - ADD FILE command (250 milliseconds) - ADD JAR command 2 (609 milliseconds) ``` ``` PruneFileSourcePartitionsSuite: - PruneFileSourcePartitions should not change the output of LogicalRelation (359 milliseconds) ``` ``` HiveCommandSuite: - LOAD DATA LOCAL (1 second, 829 milliseconds) - LOAD DATA (1 second, 735 milliseconds) - Truncate Table (1 second, 641 milliseconds) ``` ``` HiveExternalCatalogBackwardCompatibilitySuite: - make sure we can read table created by old version of Spark (32 milliseconds) - make sure we can alter table location created by old version of Spark (125 milliseconds) - make sure we can rename table created by old version of Spark (281 milliseconds) ``` ``` ExternalCatalogSuite: - create/drop/rename partitions should create/delete/rename the directory (625 milliseconds) ``` ``` SQLQuerySuite: - describe functions - temporary user defined functions (31 milliseconds) - specifying database name for a temporary table is not allowed (390 milliseconds) ``` ``` PartitionProviderCompatibilitySuite: - convert partition provider to hive with repair table (813 milliseconds) - when partition management is enabled, new tables have partition provider hive (562 milliseconds) - when partition management is disabled, new tables have no partition provider (344 milliseconds) - when partition management is disabled, we preserve the old behavior even for new tables (422 milliseconds) - insert overwrite partition of legacy datasource table (750 milliseconds) - SPARK-18544 append with saveAsTable - partition management true (985 milliseconds) - SPARK-18635 special chars in partition values - partition management true (3 seconds, 328 milliseconds) - SPARK-18635 special chars in partition values - partition management false (2 seconds, 891 milliseconds) - SPARK-18659 insert overwrite table with lowercase - partition management true (750 milliseconds) - SPARK-18544 append with saveAsTable - partition management false (656 milliseconds) - SPARK-18659 insert overwrite table files - partition management false (922 milliseconds) - SPARK-18659 insert overwrite table with lowercase - partition management false (469 milliseconds) - sanity check table setup (937 milliseconds) - insert into partial dynamic partitions (2 seconds, 985 milliseconds) - insert into fully dynamic partitions (1 second, 937 milliseconds) - insert into static partition (1 second, 578 milliseconds) - overwrite partial dynamic partitions (7 seconds, 561 milliseconds) - overwrite fully dynamic partitions (1 second, 766 milliseconds) - overwrite static partition (1 second, 797 milliseconds) ``` ``` MetastoreDataSourcesSuite: - check change without refresh (610 milliseconds) - drop, change, recreate (437 milliseconds) - SPARK-15269 external data source table creation (297 milliseconds) - CTAS with IF NOT EXISTS (437 milliseconds) - CTAS: persisted partitioned bucketed data source table (422 milliseconds) - SPARK-15025: create datasource table with path with select (265 milliseconds) - CTAS (438 milliseconds) - CTAS with IF NOT EXISTS (469 milliseconds) - CTAS: persisted partitioned bucketed data source table (406 milliseconds) ``` ``` HiveMetastoreCatalogSuite: - Persist non-partitioned parquet relation into metastore as managed table using CTAS (406 milliseconds) - Persist non-partitioned orc relation into metastore as managed table using CTAS (313 milliseconds) ``` ``` HiveUDFSuite: - SPARK-11522 select input_file_name from non-parquet table (3 seconds, 144 milliseconds) ``` ``` QueryPartitionSuite: - SPARK-13709: reading partitioned Avro table with nested schema (1 second, 67 milliseconds) ``` ``` ParquetHiveCompatibilitySuite: - simple primitives (745 milliseconds) - SPARK-10177 timestamp (375 milliseconds) - array (407 milliseconds) - map (409 milliseconds) - struct (437 milliseconds) - SPARK-16344: array of struct with a single field named 'array_element' (391 milliseconds) ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #16397 from HyukjinKwon/SPARK-18922-paths.
*	[SPARK-18808][ML][MLLIB] ml.KMeansModel.transform is very inefficient	Sean Owen	2016-12-30	2	-10/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? mllib.KMeansModel.clusterCentersWithNorm is a method than ends up being called every time `predict` is called on a single vector, which is bad news for now the ml.KMeansModel Transformer works, which necessarily transforms one vector at a time. This causes the model to just store the vectors with norms upfront. The extra norm should be small compared to the vectors. This would avoid this form of overhead on this and other code paths. ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #16328 from srowen/SPARK-18808.
*	Update known_translations for contributor names and also fix a small issue ↵	Yin Huai	2016-12-29	2	-1/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	in translate-contributors.py ## What changes were proposed in this pull request? This PR updates dev/create-release/known_translations to add more contributor name mapping. It also fixes a small issue in translate-contributors.py ## How was this patch tested? manually tested Author: Yin Huai <yhuai@databricks.com> Closes #16423 from yhuai/contributors.
*	[SPARK-19003][DOCS] Add Java example in Spark Streaming Guide, section ↵	adesharatushar	2016-12-29	1	-0/+72
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Design Patterns for using foreachRDD ## What changes were proposed in this pull request? Added missing Java example under section "Design Patterns for using foreachRDD". Now this section has examples in all 3 languages, improving consistency of documentation. ## How was this patch tested? Manual. Generated docs using command "SKIP_API=1 jekyll build" and verified generated HTML page manually. The syntax of example has been tested for correctness using sample code on Java1.7 and Spark 2.2.0-SNAPSHOT. Author: adesharatushar <tushar_adeshara@persistent.com> Closes #16408 from adesharatushar/streaming-doc-fix.
*	[SPARK-18698][ML] Adding public constructor that takes uid for IndexToString	Ilya Matiach	2016-12-29	2	-2/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Based on SPARK-18698, this adds a public constructor that takes a UID for IndexToString. Other transforms have similar constructors. ## How was this patch tested? A unit test was added to verify the new functionality. Author: Ilya Matiach <ilmat@microsoft.com> Closes #16436 from imatiach-msft/ilmat/fix-indextostring.
*	[SPARK-19012][SQL] Fix `createTempViewCommand` to throw AnalysisException ↵	Dongjoon Hyun	2016-12-29	2	-11/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	instead of ParseException ## What changes were proposed in this pull request? Currently, `createTempView`, `createOrReplaceTempView`, and `createGlobalTempView` show `ParseExceptions` on invalid table names. We had better show better error message. Also, this PR also adds and updates the missing description on the API docs correctly. BEFORE ``` scala> spark.range(10).createOrReplaceTempView("11111") org.apache.spark.sql.catalyst.parser.ParseException: mismatched input '11111' expecting {'SELECT', 'FROM', 'ADD', ...}(line 1, pos 0) == SQL == 11111 ... ``` AFTER ``` scala> spark.range(10).createOrReplaceTempView("11111") org.apache.spark.sql.AnalysisException: Invalid view name: 11111; ... ``` ## How was this patch tested? Pass the Jenkins with updated a test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16427 from dongjoon-hyun/SPARK-19012.
*	[SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelectCommand	Wenchen Fan	2016-12-28	4	-198/+213
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The `CreateDataSourceTableAsSelectCommand` is quite complex now, as it has a lot of work to do if the table already exists: 1. throw exception if we don't want to ignore it. 2. do some check and adjust the schema if we want to append data. 3. drop the table and create it again if we want to overwrite. The work 2 and 3 should be done by analyzer, so that we can also apply it to hive tables. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #15996 from cloud-fan/append.
*	[SPARK-16213][SQL] Reduce runtime overhead of a program that creates an ↵	Kazuaki Ishizaki	2016-12-29	9	-91/+230
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	primitive array in DataFrame ## What changes were proposed in this pull request? This PR reduces runtime overhead of a program the creates an primitive array in DataFrame by using the similar approach to #15044. Generated code performs boxing operation in an assignment from InternalRow to an `Object[]` temporary array (at Lines 051 and 061 in the generated code before without this PR). If we know that type of array elements is primitive, we apply the following optimizations: 1. Eliminate a pair of `isNullAt()` and a null assignment 2. Allocate an primitive array instead of `Object[]` (eliminate boxing operations) 3. Create `UnsafeArrayData` by using `UnsafeArrayWriter` to keep a primitive array in a row format instead of doing non-lightweight operations in constructor of `GenericArrayData` The PR also performs the same things for `CreateMap`. Here are performance results of [DataFrame programs](https://github.com/kiszk/spark/blob/6bf54ec5e227689d69f6db991e9ecbc54e153d0a/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/PrimitiveArrayBenchmark.scala#L83-L112) by up to 17.9x over without this PR. ``` Without SPARK-16043 OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 4.4.11-200.fc22.x86_64 Intel Xeon E3-12xx v2 (Ivy Bridge) Read a primitive array in DataFrame: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Int 3805 / 4150 0.0 507308.9 1.0X Double 3593 / 3852 0.0 479056.9 1.1X With SPARK-16043 Read a primitive array in DataFrame: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Int 213 / 271 0.0 28387.5 1.0X Double 204 / 223 0.0 27250.9 1.0X ``` Note : #15780 is enabled for these measurements An motivating example ``` java val df = sparkContext.parallelize(Seq(0.0d, 1.0d), 1).toDF df.selectExpr("Array(value + 1.1d, value + 2.2d)").show ``` Generated code without this PR ``` java /* 005 / final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator { / 006 / private Object[] references; / 007 / private scala.collection.Iterator[] inputs; / 008 / private scala.collection.Iterator inputadapter_input; / 009 / private UnsafeRow serializefromobject_result; / 010 / private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder; / 011 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter; / 012 / private Object[] project_values; / 013 / private UnsafeRow project_result; / 014 / private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder project_holder; / 015 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter project_rowWriter; / 016 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter project_arrayWriter; / 017 / / 018 / public GeneratedIterator(Object[] references) { / 019 / this.references = references; / 020 / } / 021 / / 022 / public void init(int index, scala.collection.Iterator[] inputs) { / 023 / partitionIndex = index; / 024 / this.inputs = inputs; / 025 / inputadapter_input = inputs[0]; / 026 / serializefromobject_result = new UnsafeRow(1); / 027 / this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 0); / 028 / this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1); / 029 / this.project_values = null; / 030 / project_result = new UnsafeRow(1); / 031 / this.project_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(project_result, 32); / 032 / this.project_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(project_holder, 1); / 033 / this.project_arrayWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter(); / 034 / / 035 / } / 036 / / 037 / protected void processNext() throws java.io.IOException { / 038 / while (inputadapter_input.hasNext()) { / 039 / InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); / 040 / double inputadapter_value = inputadapter_row.getDouble(0); / 041 / / 042 / final boolean project_isNull = false; / 043 / this.project_values = new Object[2]; / 044 / boolean project_isNull1 = false; / 045 / / 046 / double project_value1 = -1.0; / 047 / project_value1 = inputadapter_value + 1.1D; / 048 / if (false) { / 049 / project_values[0] = null; / 050 / } else { / 051 / project_values[0] = project_value1; / 052 / } / 053 / / 054 / boolean project_isNull4 = false; / 055 / / 056 / double project_value4 = -1.0; / 057 / project_value4 = inputadapter_value + 2.2D; / 058 / if (false) { / 059 / project_values[1] = null; / 060 / } else { / 061 / project_values[1] = project_value4; / 062 / } / 063 / / 064 / final ArrayData project_value = new org.apache.spark.sql.catalyst.util.GenericArrayData(project_values); / 065 / this.project_values = null; / 066 / project_holder.reset(); / 067 / / 068 / project_rowWriter.zeroOutNullBytes(); / 069 / / 070 / if (project_isNull) { / 071 / project_rowWriter.setNullAt(0); / 072 / } else { / 073 / // Remember the current cursor so that we can calculate how many bytes are / 074 / // written later. / 075 / final int project_tmpCursor = project_holder.cursor; / 076 / / 077 / if (project_value instanceof UnsafeArrayData) { / 078 / final int project_sizeInBytes = ((UnsafeArrayData) project_value).getSizeInBytes(); / 079 / // grow the global buffer before writing data. / 080 / project_holder.grow(project_sizeInBytes); / 081 / ((UnsafeArrayData) project_value).writeToMemory(project_holder.buffer, project_holder.cursor); / 082 / project_holder.cursor += project_sizeInBytes; / 083 / / 084 / } else { / 085 / final int project_numElements = project_value.numElements(); / 086 / project_arrayWriter.initialize(project_holder, project_numElements, 8); / 087 / / 088 / for (int project_index = 0; project_index < project_numElements; project_index++) { / 089 / if (project_value.isNullAt(project_index)) { / 090 / project_arrayWriter.setNullDouble(project_index); / 091 / } else { / 092 / final double project_element = project_value.getDouble(project_index); / 093 / project_arrayWriter.write(project_index, project_element); / 094 / } / 095 / } / 096 / } / 097 / / 098 / project_rowWriter.setOffsetAndSize(0, project_tmpCursor, project_holder.cursor - project_tmpCursor); / 099 / } / 100 / project_result.setTotalSize(project_holder.totalSize()); / 101 / append(project_result); / 102 / if (shouldStop()) return; / 103 / } / 104 / } / 105 / } ``` Generated code with this PR ``` java / 005 / final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator { / 006 / private Object[] references; / 007 / private scala.collection.Iterator[] inputs; / 008 / private scala.collection.Iterator inputadapter_input; / 009 / private UnsafeRow serializefromobject_result; / 010 / private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder; / 011 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter; / 012 / private UnsafeArrayData project_arrayData; / 013 / private UnsafeRow project_result; / 014 / private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder project_holder; / 015 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter project_rowWriter; / 016 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter project_arrayWriter; / 017 / / 018 / public GeneratedIterator(Object[] references) { / 019 / this.references = references; / 020 / } / 021 / / 022 / public void init(int index, scala.collection.Iterator[] inputs) { / 023 / partitionIndex = index; / 024 / this.inputs = inputs; / 025 / inputadapter_input = inputs[0]; / 026 / serializefromobject_result = new UnsafeRow(1); / 027 / this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 0); / 028 / this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1); / 029 / / 030 / project_result = new UnsafeRow(1); / 031 / this.project_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(project_result, 32); / 032 / this.project_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(project_holder, 1); / 033 / this.project_arrayWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter(); / 034 / / 035 / } / 036 / / 037 / protected void processNext() throws java.io.IOException { / 038 / while (inputadapter_input.hasNext()) { / 039 / InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); / 040 / double inputadapter_value = inputadapter_row.getDouble(0); / 041 / / 042 / byte[] project_array = new byte[32]; / 043 / project_arrayData = new UnsafeArrayData(); / 044 / Platform.putLong(project_array, 16, 2); / 045 / project_arrayData.pointTo(project_array, 16, 32); / 046 / / 047 / boolean project_isNull1 = false; / 048 / / 049 / double project_value1 = -1.0; / 050 / project_value1 = inputadapter_value + 1.1D; / 051 / if (false) { / 052 / project_arrayData.setNullAt(0); / 053 / } else { / 054 / project_arrayData.setDouble(0, project_value1); / 055 / } / 056 / / 057 / boolean project_isNull4 = false; / 058 / / 059 / double project_value4 = -1.0; / 060 / project_value4 = inputadapter_value + 2.2D; / 061 / if (false) { / 062 / project_arrayData.setNullAt(1); / 063 / } else { / 064 / project_arrayData.setDouble(1, project_value4); / 065 / } / 066 / project_holder.reset(); / 067 / / 068 / // Remember the current cursor so that we can calculate how many bytes are / 069 / // written later. / 070 / final int project_tmpCursor = project_holder.cursor; / 071 / / 072 / if (project_arrayData instanceof UnsafeArrayData) { / 073 / final int project_sizeInBytes = ((UnsafeArrayData) project_arrayData).getSizeInBytes(); / 074 / // grow the global buffer before writing data. / 075 / project_holder.grow(project_sizeInBytes); / 076 / ((UnsafeArrayData) project_arrayData).writeToMemory(project_holder.buffer, project_holder.cursor); / 077 / project_holder.cursor += project_sizeInBytes; / 078 / / 079 / } else { / 080 / final int project_numElements = project_arrayData.numElements(); / 081 / project_arrayWriter.initialize(project_holder, project_numElements, 8); / 082 / / 083 / for (int project_index = 0; project_index < project_numElements; project_index++) { / 084 / if (project_arrayData.isNullAt(project_index)) { / 085 / project_arrayWriter.setNullDouble(project_index); / 086 / } else { / 087 / final double project_element = project_arrayData.getDouble(project_index); / 088 / project_arrayWriter.write(project_index, project_element); / 089 / } / 090 / } / 091 / } / 092 / / 093 / project_rowWriter.setOffsetAndSize(0, project_tmpCursor, project_holder.cursor - project_tmpCursor); / 094 / project_result.setTotalSize(project_holder.totalSize()); / 095 / append(project_result); / 096 / if (shouldStop()) return; / 097 / } / 098 / } / 099 */ } ``` ## How was this patch tested? Added unit tests into `DataFrameComplexTypeSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #13909 from kiszk/SPARK-16213.
*	[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming ↵	Tathagata Das	2016-12-28	3	-107/+353
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	regarding watermarking and status ## What changes were proposed in this pull request? - Extended the Window operation section with code snippet and explanation of watermarking - Extended the Output Mode section with a table showing the compatibility between query type and output mode - Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status - Updated API changes in the StreamingQueryListener example TODO - [x] Figure showing the watermarking ## How was this patch tested? N/A ## Screenshots ### Section: Windowed Aggregation with Event Time <img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png"> ![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png) <img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png"> ---------------------------- ### Section: Output Modes ![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png) ---------------------------- ### Section: Monitoring ![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png) ![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png) Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #16294 from tdas/SPARK-18669.
*	[SPARK-17772][ML][TEST] Add test functions for ML sample weights	sethah	2016-12-28	4	-218/+154
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? More and more ML algos are accepting sample weights, and they have been tested rather heterogeneously and with code duplication. This patch adds extensible helper methods to `MLTestingUtils` that can be reused by various algorithms accepting sample weights. Up to now, there seems to be a few tests that have been implemented commonly: * Check that oversampling is the same as giving the instances sample weights proportional to the number of samples * Check that outliers with tiny sample weights do not affect the algorithm's performance This patch adds an additional test: * Check that algorithms are invariant to constant scaling of the sample weights. i.e. uniform sample weights with `w_i = 1.0` is effectively the same as uniform sample weights with `w_i = 10000` or `w_i = 0.0001` The instances of these tests occurred in LinearRegression, NaiveBayes, and LogisticRegression. Those tests have been removed/modified to use the new helper methods. These helper functions will be of use when [SPARK-9478](https://issues.apache.org/jira/browse/SPARK-9478) is implemented. ## How was this patch tested? This patch only involves modifying test suites. ## Other notes Both IsotonicRegression and GeneralizedLinearRegression also extend `HasWeightCol`. I did not modify these test suites because it will make this patch easier to review, and because they did not duplicate the same tests as the three suites that were modified. If we want to change them later, we can create a JIRA for it now, but it's open for debate. Author: sethah <seth.hendrickson16@gmail.com> Closes #15721 from sethah/SPARK-17772.
*	[SPARK-18993][BUILD] Unable to build/compile Spark in IntelliJ due to ↵	Sean Owen	2016-12-28	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	missing Scala deps in spark-tags ## What changes were proposed in this pull request? This adds back a direct dependency on Scala library classes from spark-tags because its Scala annotations need them. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #16418 from srowen/SPARK-18993.
*	[MINOR][DOC] Fix doc of ForeachWriter to use writeStream	Carson Wang	2016-12-28	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Fix the document of `ForeachWriter` to use `writeStream` instead of `write` for a streaming dataset. ## How was this patch tested? Docs only. Author: Carson Wang <carson.wang@intel.com> Closes #16419 from carsonwang/FixDoc.
*	[SPARK-18960][SQL][SS] Avoid double reading file which is being copied.	uncleGen	2016-12-28	2	-3/+9
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? In HDFS, when we copy a file into target directory, there will a temporary `._COPY_` file for a period of time. The duration depends on file size. If we do not skip this file, we will may read the same data for two times. ## How was this patch tested? update unit test Author: uncleGen <hustyugm@gmail.com> Closes #16370 from uncleGen/SPARK-18960.
*	[SPARK-19010][CORE] Include Kryo exception in case of overflow	Sergei Lebedev	2016-12-28	2	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This is to workaround an implicit result of #4947 which suppressed the original Kryo exception if the overflow happened during serialization. ## How was this patch tested? `KryoSerializerSuite` was augmented to reflect this change. Author: Sergei Lebedev <superbobry@gmail.com> Closes #16416 from superbobry/patch-1.
*	[MINOR][ML] Correct test cases of LoR raw2prediction & probability2prediction.	Yanbo Liang	2016-12-28	1	-2/+18
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Correct test cases of ```LogisticRegression``` raw2prediction & probability2prediction. ## How was this patch tested? Changed unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16407 from yanboliang/raw-probability.
*	[SPARK-17645][MLLIB][ML] add feature selector method based on: False ↵	Peng	2016-12-28	9	-64/+337
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Discovery Rate (FDR) and Family wise error rate (FWE) ## What changes were proposed in this pull request? Univariate feature selection works by selecting the best features based on univariate statistical tests. FDR and FWE are a popular univariate statistical test for feature selection. In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate. In statistics, FWE is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests. https://en.wikipedia.org/wiki/Family-wise_error_rate We add FDR and FWE methods for ChiSqSelector in this PR, like it is implemented in scikit-learn. http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection ## How was this patch tested? ut will be added soon (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Peng <peng.meng@intel.com> Author: Peng, Meng <peng.meng@intel.com> Closes #15212 from mpjlu/fdr_fwe.
*	[DOC][BUILD][MINOR] add doc on new make-distribution switches	Felix Cheung	2016-12-27	1	-6/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? add example with `--pip` and `--r` switch as it is actually done in create-release ## How was this patch tested? Doc only Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16364 from felixcheung/buildguide.
*	[SPARK-18992][SQL] Move spark.sql.hive.thriftServer.singleSession to SQLConf	gatorsmile	2016-12-28	5	-51/+87
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	### What changes were proposed in this pull request? Since `spark.sql.hive.thriftServer.singleSession` is a configuration of SQL component, this conf can be moved from `SparkConf` to `StaticSQLConf`. When we introduced `spark.sql.hive.thriftServer.singleSession`, all the SQL configuration are session specific. They can be modified in different sessions. In Spark 2.1, static SQL configuration is added. It is a perfect fit for `spark.sql.hive.thriftServer.singleSession`. Previously, we did the same move for `spark.sql.warehouse.dir` from `SparkConf` to `StaticSQLConf` ### How was this patch tested? Added test cases in HiveThriftServer2Suites.scala Author: gatorsmile <gatorsmile@gmail.com> Closes #16392 from gatorsmile/hiveThriftServerSingleSession.
*	[SPARK-19006][DOCS] mention spark.kryoserializer.buffer.max must be less ↵	Yuexin Zhang	2016-12-27	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	than 2048m in doc ## What changes were proposed in this pull request? On configuration doc page:https://spark.apache.org/docs/latest/configuration.html We mentioned spark.kryoserializer.buffer.max : Maximum allowable size of Kryo serialization buffer. This must be larger than any object you attempt to serialize. Increase this if you get a "buffer limit exceeded" exception inside Kryo. from source code, it has hard coded upper limit : ``` val maxBufferSizeMb = conf.getSizeAsMb("spark.kryoserializer.buffer.max", "64m").toInt if (maxBufferSizeMb >= ByteUnit.GiB.toMiB(2)) { throw new IllegalArgumentException("spark.kryoserializer.buffer.max must be less than " + s"2048 mb, got: + $maxBufferSizeMb mb.") } ``` We should mention "this value must be less than 2048 mb" on the configuration doc page as well. ## How was this patch tested? None. Since it's minor doc change. Author: Yuexin Zhang <yxzhang@cloudera.com> Closes #16412 from cnZach/SPARK-19006.
*	[SPARK-18842][TESTS] De-duplicate paths in classpaths in processes for ↵	hyukjinkwon	2016-12-27	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	local-cluster mode in ReplSuite to work around the length limitation on Windows ## What changes were proposed in this pull request? `ReplSuite`s hang due to the length limitation on Windows with the exception as below: ``` Spark context available as 'sc' (master = local-cluster[1,1,1024], app id = app-20161223114000-0000). Spark session available as 'spark'. Exception in thread "ExecutorRunner for app-20161223114000-0000/26995" java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:622) at java.lang.StringBuilder.append(StringBuilder.java:202) at java.lang.ProcessImpl.createCommandLine(ProcessImpl.java:194) at java.lang.ProcessImpl.<init>(ProcessImpl.java:340) at java.lang.ProcessImpl.start(ProcessImpl.java:137) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) at org.apache.spark.deploy.worker.ExecutorRunner.org$apache$spark$deploy$worker$ExecutorRunner$$fetchAndRunExecutor(ExecutorRunner.scala:167) at org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:73) ``` The reason is, it keeps failing and goes in an infinite loop. This fails because it uses the paths (via `getFile`) from URLs in the tests whereas some added afterward are normal local paths. (`url.getFile` gives `/C:/a/b/c` and some paths are added later as the format of `C:\a\b\c`. ) So, many classpaths are duplicated because normal local paths and paths from URLs are mixed. This length is up to 40K which hits the length limitation problem (32K) on Windows. The full command line built here is - https://gist.github.com/HyukjinKwon/46af7946c9a5fd4c6fc70a8a0aba1beb ## How was this patch tested? Manually via AppVeyor. Before https://ci.appveyor.com/project/spark-test/spark/build/395-find-path-issues After https://ci.appveyor.com/project/spark-test/spark/build/398-find-path-issues Author: hyukjinkwon <gurwls223@gmail.com> Closes #16398 from HyukjinKwon/SPARK-18842-more.
*	Revert "[SPARK-18990][SQL] make DatasetBenchmark fairer for Dataset"	Yin Huai	2016-12-27	1	-42/+33
\| \| \| \|	This reverts commit a05cc425a0a7d18570b99883993a04ad175aa071.
*	[SPARK-18990][SQL] make DatasetBenchmark fairer for Dataset	Wenchen Fan	2016-12-27	1	-33/+42
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Currently `DatasetBenchmark` use `case class Data(l: Long, s: String)` as the record type of `RDD` and `Dataset`, which introduce serialization overhead only to `Dataset` and is unfair. This PR use `Long` as the record type, to be fairer for `Dataset` ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #16391 from cloud-fan/benchmark.
*	[SPARK-19004][SQL] Fix `JDBCWriteSuite.testH2Dialect` by removing ↵	Dongjoon Hyun	2016-12-27	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	`getCatalystType` ## What changes were proposed in this pull request? `JDBCSuite` and `JDBCWriterSuite` have their own `testH2Dialect`s for their testing purposes. This PR fixes `testH2Dialect` in `JDBCWriterSuite` by removing `getCatalystType` implementation in order to return correct types. Currently, it always returns `Some(StringType)` incorrectly. Note that, for the `testH2Dialect` in `JDBCSuite`, it's intentional because of the test case `Remap types via JdbcDialects`. ## How was this patch tested? This is a test only update. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16409 from dongjoon-hyun/SPARK-H2-DIALECT.
*	[SPARK-18999][SQL][MINOR] simplify Literal codegen	Wenchen Fan	2016-12-27	4	-35/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `Literal` can use `CodegenContex.addReferenceObj` to implement codegen, instead of `CodegenFallback`. This can also simplify the generated code a little bit, before we will generate: `((Expression) references[1]).eval(null)`, now it's just `references[1]`. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #16402 from cloud-fan/minor.
*	[SPARK-18989][SQL] DESC TABLE should not fail with format class not found	Wenchen Fan	2016-12-26	2	-2/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? When we describe a table, we only wanna see the information of this table, not read it, so it's ok even if the format class is not present at the classpath. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #16388 from cloud-fan/hive.
*	[SPARK-18980][SQL] implement Aggregator with TypedImperativeAggregate	Wenchen Fan	2016-12-26	10	-56/+212
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Currently we implement `Aggregator` with `DeclarativeAggregate`, which will serialize/deserialize the buffer object every time we process an input. This PR implements `Aggregator` with `TypedImperativeAggregate` and avoids to serialize/deserialize buffer object many times. The benchmark shows we get about 2 times speed up. For simple buffer object that doesn't need serialization, we still go with `DeclarativeAggregate`, to avoid performance regression. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #16383 from cloud-fan/aggregator.
*	[SPARK-17755][CORE] Use workerRef to send RegisterWorkerResponse to avoid ↵	Shixiong Zhu	2016-12-25	3	-41/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	the race condition ## What changes were proposed in this pull request? The root cause of this issue is that RegisterWorkerResponse and LaunchExecutor are sent via two different channels (TCP connections) and their order is not guaranteed. This PR changes the master and worker codes to use `workerRef` to send RegisterWorkerResponse, so that RegisterWorkerResponse and LaunchExecutor are sent via the same connection. Hence `LaunchExecutor` will always be after `RegisterWorkerResponse` and never be ignored. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16345 from zsxwing/SPARK-17755.
*	[SPARK-18943][SQL] Avoid per-record type dispatch in CSV when reading	hyukjinkwon	2016-12-24	3	-159/+184
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `CSVRelation.csvParser` does type dispatch for each value in each row. We can prevent this because the schema is already kept in `CSVRelation`. So, this PR proposes that converters are created first according to the schema, and then apply them to each. I just ran some small benchmarks as below after resembling the logics in https://github.com/apache/spark/blob/7c33b0fd050f3d2b08c1cfd7efbff8166832c1af/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L170-L178 to test the updated logics. ```scala test("Benchmark for CSV converter") { var numMalformedRecords = 0 val N = 500 << 12 val schema = StructType( StructField("a", StringType) :: StructField("b", StringType) :: StructField("c", StringType) :: StructField("d", StringType) :: Nil) val row = Array("1.0", "test", "2015-08-20 14:57:00", "FALSE") val data = spark.sparkContext.parallelize(List.fill(N)(row)) val parser = CSVRelation.csvParser(schema, schema.fieldNames, CSVOptions()) val benchmark = new Benchmark("CSV converter", N) benchmark.addCase("cast CSV string tokens", 10) { _ => data.flatMap { recordTokens => parser(recordTokens, numMalformedRecords) }.collect() } benchmark.run() } ``` Before ``` CSV converter: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ cast CSV string tokens 1061 / 1130 1.9 517.9 1.0X ``` After ``` CSV converter: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ cast CSV string tokens 940 / 1011 2.2 459.2 1.0X ``` ## How was this patch tested? Tests in `CSVTypeCastSuite` and `CSVRelation` Author: hyukjinkwon <gurwls223@gmail.com> Closes #16351 from HyukjinKwon/type-dispatch.
*	[SPARK-18837][WEBUI] Very long stage descriptions do not wrap in the UI	Kousuke Saruta	2016-12-24	3	-2/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This issue was reported by wangyum. In the AllJobsPage, JobPage and StagePage, the description length was limited before like as follows. ![ui-2 0 0](https://cloud.githubusercontent.com/assets/4736016/21319673/8b225246-c651-11e6-9041-4fcdd04f4dec.gif) But recently, the limitation seems to have been accidentally removed. ![ui-2 1 0](https://cloud.githubusercontent.com/assets/4736016/21319825/104779f6-c652-11e6-8bfa-dfd800396352.gif) The cause is that some tables are no longer `sortable` class although they were, and `sortable` class does not only mark tables as sortable but also limited the width of their child `td` elements. The reason why now some tables are not `sortable` class is because another sortable mechanism was introduced by #13620 and #13708 with pagination feature. To fix this issue, I've introduced new class `table-cell-width-limited` which limits the description cell width and the description is like what it was. <img width="1260" alt="2016-12-20 1 00 34" src="https://cloud.githubusercontent.com/assets/4736016/21320478/89141c7a-c654-11e6-8494-f8f91325980b.png"> ## How was this patch tested? Tested manually with my browser. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #16338 from sarutak/SPARK-18837.
*	[SPARK-18800][SQL] Correct the assert in UnsafeKVExternalSorter which ↵	Liang-Chi Hsieh	2016-12-24	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ensures array size ## What changes were proposed in this pull request? `UnsafeKVExternalSorter` uses `UnsafeInMemorySorter` to sort the records of `BytesToBytesMap` if it is given a map. Currently we use the number of keys in `BytesToBytesMap` to determine if the array used for sort is enough or not. We has an assert that ensures the size of the array is enough: `map.numKeys() <= map.getArray().size() / 2`. However, each record in the map takes two entries in the array, one is record pointer, another is key prefix. So the correct assert should be `map.numKeys() * 2 <= map.getArray().size() / 2`. ## How was this patch tested? N/A Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #16232 from viirya/SPARK-18800-fix-UnsafeKVExternalSorter.
*	[SPARK-18911][SQL] Define CatalogStatistics to interact with metastore and ↵	wangzhenhua	2016-12-24	9	-23/+95
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	convert it to Statistics in relations ## What changes were proposed in this pull request? Statistics in LogicalPlan should use attributes to refer to columns rather than column names, because two columns from two relations can have the same column name. But CatalogTable doesn't have the concepts of attribute or broadcast hint in Statistics. Therefore, putting Statistics in CatalogTable is confusing. We define a different statistic structure in CatalogTable, which is only responsible for interacting with metastore, and is converted to statistics in LogicalPlan when it is used. ## How was this patch tested? add test cases Author: wangzhenhua <wangzhenhua@huawei.com> Author: Zhenhua Wang <wzh_zju@163.com> Closes #16323 from wzhfy/nameToAttr.
*	[SPARK-18991][CORE] Change ContextCleaner.referenceBuffer to use ↵	Shixiong Zhu	2016-12-23	1	-6/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ConcurrentHashMap to make it faster ## What changes were proposed in this pull request? The time complexity of ConcurrentHashMap's `remove` is O(1). Changing ContextCleaner.referenceBuffer's type from `ConcurrentLinkedQueue` to `ConcurrentHashMap's` will make the removal much faster. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16390 from zsxwing/SPARK-18991.
*	[SPARK-18963] o.a.s.unsafe.types.UTF8StringSuite.writeToOutputStreamIntArray ↵	Pete Robbins	2016-12-23	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	test fails on big endian. Only change byte order on little endian ## What changes were proposed in this pull request? Fix test to only change byte order on LE platforms ## How was this patch tested? Test run on Big Endian and Little Endian platforms Author: Pete Robbins <robbinspg@gmail.com> Closes #16375 from robbinspg/SPARK-18963.
*	[SPARK-18958][SPARKR] R API toJSON on DataFrame	Felix Cheung	2016-12-22	3	-18/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? It would make it easier to integrate with other component expecting row-based JSON format. This replaces the non-public toJSON RDD API. ## How was this patch tested? manual, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16368 from felixcheung/rJSON.
*	[SPARK-18972][CORE] Fix the netty thread names for RPC	Shixiong Zhu	2016-12-22	5	-9/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Right now the name of threads created by Netty for Spark RPC are `shuffle-client-` and `shuffle-server-`. It's pretty confusing. This PR just uses the module name in TransportConf to set the thread name. In addition, it also includes the following minor fixes: - TransportChannelHandler.channelActive and channelInactive should call the corresponding super methods. - Make ShuffleBlockFetcherIterator throw NoSuchElementException if it has no more elements. Otherwise, if the caller calls `next` without `hasNext`, it will just hang. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16380 from zsxwing/SPARK-18972.
*	[SPARK-18985][SS] Add missing @InterfaceStability.Evolving for Structured ↵	Shixiong Zhu	2016-12-22	9	-9/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Streaming APIs ## What changes were proposed in this pull request? Add missing InterfaceStability.Evolving for Structured Streaming APIs ## How was this patch tested? Compiling the codes. Author: Shixiong Zhu <shixiong@databricks.com> Closes #16385 from zsxwing/SPARK-18985.
*	[SPARK-18537][WEB UI] Add a REST api to serve spark streaming information	saturday_s	2016-12-22	24	-5/+689
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR is an inheritance from #16000, and is a completion of #15904. Description - Augment the `org.apache.spark.status.api.v1` package for serving streaming information. - Retrieve the streaming information through StreamingJobProgressListener. > this api should cover exceptly the same amount of information as you can get from the web interface > the implementation is base on the current REST implementation of spark-core > and will be available for running applications only > > https://issues.apache.org/jira/browse/SPARK-18537 ## How was this patch tested? Local test. Author: saturday_s <shi.indetail@gmail.com> Author: Chan Chor Pang <ChorPang.Chan@access-company.com> Author: peterCPChan <universknight@gmail.com> Closes #16253 from saturday-shi/SPARK-18537.
*	[SPARK-18975][CORE] Add an API to remove SparkListener	jerryshao	2016-12-22	2	-0/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? In current Spark we could add customized SparkListener through `SparkContext#addListener` API, but there's no equivalent API to remove the registered one. In our scenario SparkListener will be added repeatedly accordingly to the changed environment. If lacks the ability to remove listeners, there might be many registered listeners finally, this is unnecessary and potentially affects the performance. So here propose to add an API to remove registered listener. ## How was this patch tested? Add an unit test to verify it. Author: jerryshao <sshao@hortonworks.com> Closes #16382 from jerryshao/SPARK-18975.
*	[SPARK-18973][SQL] Remove SortPartitions and RedistributeData	Reynold Xin	2016-12-22	8	-58/+26
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? SortPartitions and RedistributeData logical operators are not actually used and can be removed. Note that we do have a Sort operator (with global flag false) that subsumed SortPartitions. ## How was this patch tested? Also updated test cases to reflect the removal. Author: Reynold Xin <rxin@databricks.com> Closes #16381 from rxin/SPARK-18973.
*	[SPARK-16975][SQL][FOLLOWUP] Do not duplicately check file paths in data ↵	hyukjinkwon	2016-12-22	4	-15/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	sources implementing FileFormat ## What changes were proposed in this pull request? This PR cleans up duplicated checking for file paths in implemented data sources and prevent to attempt to list twice in ORC data source. https://github.com/apache/spark/pull/14585 handles a problem for the partition column name having `_` and the issue itself is resolved correctly. However, it seems the data sources implementing `FileFormat` are validating the paths duplicately. Assuming from the comment in `CSVFileFormat`, `// TODO: Move filtering.`, I guess we don't have to check this duplicately. Currently, this seems being filtered in `PartitioningAwareFileIndex.shouldFilterOut` and`PartitioningAwareFileIndex.isDataPath`. So, `FileFormat.inferSchema` will always receive leaf files. For example, running to codes below: ``` scala spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet") spark.read.parquet("/tmp/parquet") ``` gives the paths below without directories but just valid data files: ``` bash /tmp/parquet/_col=0/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet /tmp/parquet/_col=1/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet /tmp/parquet/_col=2/part-r-00000-25de2b50-225a-4bcf-a2bc-9eb9ed407ef6.snappy.parquet ... ``` to `FileFormat.inferSchema`. ## How was this patch tested? Unit test added in `HadoopFsRelationTest` and related existing tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #14627 from HyukjinKwon/SPARK-16975.
*	[SPARK-18922][TESTS] Fix more resource-closing-related and path-related test ↵	hyukjinkwon	2016-12-22	5	-8/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	failures in identified ones on Windows ## What changes were proposed in this pull request? There are several tests failing due to resource-closing-related and path-related problems on Windows as below. - `SQLQuerySuite`: ``` - specifying database name for a temporary table is not allowed * FAILED * (125 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-1f4471ab-aac0-4239-ae35-833d54b37e52; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370) ``` - `JsonSuite`: ``` - Loading a JSON dataset from a text file with SQL * FAILED * (94 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-c918a8b7-fc09-433c-b9d0-36c0f78ae918; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370) ``` - `StateStoreSuite`: ``` - SPARK-18342: commit fails when rename fails * FAILED * (16 milliseconds) java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?????-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0 at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.<init>(Path.java:116) at org.apache.hadoop.fs.Path.<init>(Path.java:89) ... Cause: java.net.URISyntaxException: Relative path in absolute URI: StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?????-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0 at java.net.URI.checkPath(URI.java:1823) at java.net.URI.<init>(URI.java:745) at org.apache.hadoop.fs.Path.initialize(Path.java:203) ``` - `HDFSMetadataLogSuite`: ``` - FileManager: FileContextManager * FAILED * (94 milliseconds) java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-415bb0bd-396b-444d-be82-04599e025f21 at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010) at org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127) at org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38) - FileManager: FileSystemManager * FAILED * (78 milliseconds) java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-ef8222cd-85aa-47c0-a396-bc7979e15088 at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010) at org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127) at org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38) ``` And, there are some tests being failed due to the length limitation on cmd in Windows as below: - `LauncherBackendSuite`: ``` - local: launcher handle * FAILED * (30 seconds, 120 milliseconds) The code passed to eventually never returned normally. Attempted 283 times over 30.0960053 seconds. Last failure message: The reference was null. (LauncherBackendSuite.scala:56) org.scalatest.exceptions.TestFailedDueToTimeoutException: at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) - standalone/client: launcher handle * FAILED * (30 seconds, 47 milliseconds) The code passed to eventually never returned normally. Attempted 282 times over 30.037987100000002 seconds. Last failure message: The reference was null. (LauncherBackendSuite.scala:56) org.scalatest.exceptions.TestFailedDueToTimeoutException: at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) ``` The executed command is, https://gist.github.com/HyukjinKwon/d3fdd2e694e5c022992838a618a516bd, which is 16K length; however, the length limitation is 8K on Windows. So, it is being failed to launch. This PR proposes to fix the test failures on Windows and skip the tests failed due to the length limitation ## How was this patch tested? Manually tested via AppVeyor Before `SQLQuerySuite `: https://ci.appveyor.com/project/spark-test/spark/build/306-pr-references `JsonSuite`: https://ci.appveyor.com/project/spark-test/spark/build/307-pr-references `StateStoreSuite` : https://ci.appveyor.com/project/spark-test/spark/build/305-pr-references `HDFSMetadataLogSuite`: https://ci.appveyor.com/project/spark-test/spark/build/304-pr-references `LauncherBackendSuite`: https://ci.appveyor.com/project/spark-test/spark/build/303-pr-references After `SQLQuerySuite`: https://ci.appveyor.com/project/spark-test/spark/build/293-SQLQuerySuite `JsonSuite`: https://ci.appveyor.com/project/spark-test/spark/build/294-JsonSuite `StateStoreSuite`: https://ci.appveyor.com/project/spark-test/spark/build/297-StateStoreSuite `HDFSMetadataLogSuite`: https://ci.appveyor.com/project/spark-test/spark/build/319-pr-references `LauncherBackendSuite`: failed test skipped. Author: hyukjinkwon <gurwls223@gmail.com> Closes #16335 from HyukjinKwon/more-fixes-on-windows.
*	[SPARK-18953][CORE][WEB UI] Do now show the link to a dead worker on the ↵	Dongjoon Hyun	2016-12-22	1	-6/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	master page ## What changes were proposed in this pull request? For a dead worker, we will not be able to see its worker page anyway. This PR removes the links to dead workers from the master page. ## How was this patch tested? Since this is UI change, please do the following steps manually. 1. Start a master and a slave ``` sbin/start-master.sh sbin/start-slave.sh spark://10.22.16.140:7077 ``` ![1_live_worker_a](https://cloud.githubusercontent.com/assets/9700541/21373572/d7e187d6-c6d4-11e6-9110-f4371d215dec.png) 2. Stop the slave ``` sbin/stop-slave.sh ``` ![2_dead_worker_a](https://cloud.githubusercontent.com/assets/9700541/21373579/dd9e9704-c6d4-11e6-9047-a22cb0aa83ed.png) 3. Start a slave ``` sbin/start-slave.sh spark://10.22.16.140:7077 ``` ![3_dead_worder_a_and_live_worker_b](https://cloud.githubusercontent.com/assets/9700541/21373582/e1b207f4-c6d4-11e6-89cb-6d8970175a5e.png) 4. Stop the slave ``` sbin/stop-slave.sh ``` ![4_dead_worker_a_and_b](https://cloud.githubusercontent.com/assets/9700541/21373584/e5fecb4e-c6d4-11e6-95d3-49defe366946.png) 5. Driver list testing Do the followings and stop the slave in a minute by `sbin/stop-slave.sh`. ``` sbin/start-master.sh sbin/start-slave.sh spark://10.22.16.140:7077 bin/spark-submit --master=spark://10.22.16.140:7077 --deploy-mode=cluster --class org.apache.spark.examples.SparkPi examples/target/scala-2.11/jars/spark-examples_2.11-2.2.0-SNAPSHOT.jar 10000 ``` ![5_dead_worker_in_driver_list](https://cloud.githubusercontent.com/assets/9700541/21401320/be6cc9fc-c768-11e6-8de7-6512961296a5.png) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16366 from dongjoon-hyun/SPARK-18953.
*	[DOC] bucketing is applicable to all file-based data sources	Reynold Xin	2016-12-21	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Starting Spark 2.1.0, bucketing feature is available for all file-based data sources. This patch fixes some function docs that haven't yet been updated to reflect that. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #16349 from rxin/ds-doc.
*	[SQL] Minor readability improvement for partition handling code	Reynold Xin	2016-12-22	4	-44/+49
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch includes minor changes to improve readability for partition handling code. I'm in the middle of implementing some new feature and found some naming / implicit type inference not as intuitive. ## How was this patch tested? This patch should have no semantic change and the changes should be covered by existing test cases. Author: Reynold Xin <rxin@databricks.com> Closes #16378 from rxin/minor-fix.