aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-16625][SQL] General data types to be mapped to OracleYuming Wang2016-08-053-17/+124
| | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Spark will convert **BooleanType** to **BIT(1)**, **LongType** to **BIGINT**, **ByteType** to **BYTE** when saving DataFrame to Oracle, but Oracle does not support BIT, BIGINT and BYTE types. This PR is convert following _Spark Types_ to _Oracle types_ refer to [Oracle Developer's Guide](https://docs.oracle.com/cd/E19501-01/819-3659/gcmaz/) Spark Type | Oracle ----|---- BooleanType | NUMBER(1) IntegerType | NUMBER(10) LongType | NUMBER(19) FloatType | NUMBER(19, 4) DoubleType | NUMBER(19, 4) ByteType | NUMBER(3) ShortType | NUMBER(5) ## How was this patch tested? Add new tests in [JDBCSuite.scala](https://github.com/wangyum/spark/commit/22b0c2a4228cb8b5098ad741ddf4d1904e745ff6#diff-dc4b58851b084b274df6fe6b189db84d) and [OracleDialect.scala](https://github.com/wangyum/spark/commit/22b0c2a4228cb8b5098ad741ddf4d1904e745ff6#diff-5e0cadf526662f9281aa26315b3750ad) Author: Yuming Wang <wgyumg@gmail.com> Closes #14377 from wangyum/SPARK-16625.
* [MINOR] Update AccumulatorV2 doc to not mention "+=".petermaxlee2016-08-051-1/+1
| | | | | | | | | | | | ## What changes were proposed in this pull request? As reported by Bryan Cutler on the mailing list, AccumulatorV2 does not have a += method, yet the documentation still references it. ## How was this patch tested? N/A Author: petermaxlee <petermaxlee@gmail.com> Closes #14466 from petermaxlee/accumulator.
* [SPARK-16312][STREAMING][KAFKA][DOC] Doc for Kafka 0.10 integrationcody koeninger2016-08-054-207/+452
| | | | | | | | | | | | ## What changes were proposed in this pull request? Doc for the Kafka 0.10 integration ## How was this patch tested? Scala code examples were taken from my example repo, so hopefully they compile. Author: cody koeninger <cody@koeninger.org> Closes #14385 from koeninger/SPARK-16312.
* [SPARK-16879][SQL] unify logical plans for CREATE TABLE and CTASWenchen Fan2016-08-0517-372/+417
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? we have various logical plans for CREATE TABLE and CTAS: `CreateTableUsing`, `CreateTableUsingAsSelect`, `CreateHiveTableAsSelectLogicalPlan`. This PR unifies them to reduce the complexity and centralize the error handling. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #14482 from cloud-fan/table.
* [SPARK-15726][SQL] Make DatasetBenchmark fairer among Dataset, DataFrame and RDDHiroshi Inoue2016-08-051-25/+25
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? DatasetBenchmark compares the performances of RDD, DataFrame and Dataset while running the same operations. However, there are two problems that make the comparisons unfair. 1) In backToBackMap test case, only DataFrame implementation executes less work compared to RDD or Dataset implementations. This test case processes Long+String pairs, but the output from the DataFrame implementation does not include String part while RDD or Dataset generates Long+String pairs as output. This difference significantly changes the performance characteristics due to the String manipulation and creation overheads. 2) In back-to-back map and back-to-back filter test cases, `map` or `filter` operation is executed only once regardless of `numChains` parameter for RDD. Hence the execution times for RDD have been largely underestimated. Of course, these issues do not affect Spark users, but it may confuse Spark developers. ## How was this patch tested? By executing the DatasetBenchmark Author: Hiroshi Inoue <inouehrs@jp.ibm.com> Closes #13459 from inouehrs/fix_benchmark_fairness.
* [SPARK-16907][SQL] Fix performance regression for parquet table when ↵Sean Zhong2016-08-051-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | vectorized parquet record reader is not being used ## What changes were proposed in this pull request? For non-partitioned parquet table, if the vectorized parquet record reader is not being used, Spark 2.0 adds an extra unnecessary memory copy to append partition values for each row. There are several typical cases that vectorized parquet record reader is not being used: 1. When the table schema is not flat, like containing nested fields. 2. When `spark.sql.parquet.enableVectorizedReader = false` By fixing this bug, we get about 20% - 30% performance gain in test case like this: ``` // Generates parquet table with nested columns spark.range(100000000).select(struct($"id").as("nc")).write.parquet("/tmp/data4") def time[R](block: => R): Long = { val t0 = System.nanoTime() val result = block // call-by-name val t1 = System.nanoTime() println("Elapsed time: " + (t1 - t0)/1000000 + "ms") (t1 - t0)/1000000 } val x = ((0 until 20).toList.map(x => time(spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect()))).sum/20 ``` ## How was this patch tested? After a few times warm up, we get 26% performance improvement Before fix: ``` Average: 4584ms, raw data (10 tries): 4726ms 4509ms 4454ms 4879ms 4586ms 4733ms 4500ms 4361ms 4456ms 4640ms ``` After fix: ``` Average: 3614ms, raw data(10 tries): 3554ms 3740ms 4019ms 3439ms 3460ms 3664ms 3557ms 3584ms 3612ms 3531ms ``` Test env: Intel(R) Core(TM) i7-6700 CPU 3.40GHz, Intel SSD SC2KW24 Author: Sean Zhong <seanzhong@databricks.com> Closes #14445 from clockfly/fix_parquet_regression_2.
* MAINTENANCE. Cleaning up stale PRs.Marcelo Vanzin2016-08-040-0/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | Closing the following PRs due to requests or unresponsive users. Closes #13923 Closes #14462 Closes #13123 Closes #14423 (requested by srowen) Closes #14424 (requested by srowen) Closes #14101 (requested by jkbradley) Closes #10676 (requested by srowen) Closes #10943 (requested by yhuai) Closes #9936 Closes #10701 Closes #10474 Closes #13248 Closes #14347 Closes #10356 Closes #9866 Closes #14310 (requested by srowen) Closes #14390 (requested by srowen) Closes #14343 (requested by srowen) Closes #14402 (requested by srowen) Closes #14437 (requested by srowen) Closes #12000 (already merged)
* [HOTFIX] Remove unnecessary imports from #12944 that broke buildJosh Rosen2016-08-041-5/+0
| | | | | | Author: Josh Rosen <joshrosen@databricks.com> Closes #14499 from JoshRosen/hotfix.
* [SPARK-15074][SHUFFLE] Cache shuffle index file to speedup shuffle fetchSital Kedia2016-08-045-12/+138
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Shuffle fetch on large intermediate dataset is slow because the shuffle service open/close the index file for each shuffle fetch. This change introduces a cache for the index information so that we can avoid accessing the index files for each block fetch ## How was this patch tested? Tested by running a job on the cluster and the shuffle read time was reduced by 50%. Author: Sital Kedia <skedia@fb.com> Closes #12944 from sitalkedia/shuffle_service.
* [SPARK-16863][ML] ProbabilisticClassifier.fit check threshoulds' lengthZheng RuiFeng2016-08-044-0/+28
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add threshoulds' length checking for Classifiers which extends ProbabilisticClassifier ## How was this patch tested? unit tests and manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #14470 from zhengruifeng/classifier_check_setThreshoulds_length.
* [SPARK-16877][BUILD] Add rules for preventing to use Java annotations ↵hyukjinkwon2016-08-041-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (Deprecated and Override) ## What changes were proposed in this pull request? This PR adds both rules for preventing to use `Deprecated` and `Override`. - Java's `Override` It seems Scala compiler just ignores this. Apparently, `override` modifier is only mandatory for " that override some other **concrete member definition** in a parent class" but not for for **incomplete member definition** (such as ones from trait or abstract), see (http://www.scala-lang.org/files/archive/spec/2.11/05-classes-and-objects.html#override) For a simple example, - Normal class - needs `override` modifier ```bash scala> class A { def say = {}} defined class A scala> class B extends A { def say = {}} <console>:8: error: overriding method say in class A of type => Unit; method say needs `override' modifier class B extends A { def say = {}} ^ ``` - Trait - does not need `override` modifier ```bash scala> trait A { def say } defined trait A scala> class B extends A { def say = {}} defined class B ``` To cut this short, this case below is possible, ```bash scala> class B extends A { | Override | def say = {} | } defined class B ``` we can write `Override` annotation (meaning nothing) which might confuse engineers that Java's annotation is working fine. It might be great if we prevent those potential confusion. - Java's `Deprecated` When `Deprecated` is used, it seems Scala compiler recognises this correctly but it seems we use Scala one `deprecated` across codebase. ## How was this patch tested? Manually tested, by inserting both `Override` and `Deprecated`. This will shows the error messages as below: ```bash Scalastyle checks failed at following occurrences: [error] ... : deprecated should be used instead of java.lang.Deprecated. ``` ```basg Scalastyle checks failed at following occurrences: [error] ... : override modifier should be used instead of java.lang.Override. ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #14490 from HyukjinKwon/SPARK-16877.
* [SPARK-16880][ML][MLLIB] make ann training data persisted if neededWeichenXu2016-08-041-2/+7
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? To Make sure ANN layer input training data to be persisted, so that it can avoid overhead cost if the RDD need to be computed from lineage. ## How was this patch tested? Existing Tests. Author: WeichenXu <WeichenXu123@outlook.com> Closes #14483 from WeichenXu123/add_ann_persist_training_data.
* [SPARK-16875][SQL] Add args checking for DataSet randomSplit and sampleZheng RuiFeng2016-08-042-14/+37
| | | | | | | | | | | | | ## What changes were proposed in this pull request? Add the missing args-checking for randomSplit and sample ## How was this patch tested? unit tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #14478 from zhengruifeng/fix_randomSplit.
* [SPARK-16884] Move DataSourceScanExec out of ExistingRDD.scala fileEric Liang2016-08-043-504/+525
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This moves DataSourceScanExec out so it's more discoverable, and now that it doesn't necessarily depend on an existing RDD. cc davies ## How was this patch tested? Existing tests. Author: Eric Liang <ekl@databricks.com> Closes #14487 from ericl/split-scan.
* [SPARK-16802] [SQL] fix overflow in LongToUnsafeRowMapDavies Liu2016-08-042-6/+55
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch fix the overflow in LongToUnsafeRowMap when the range of key is very wide (the key is much much smaller then minKey, for example, key is Long.MinValue, minKey is > 0). ## How was this patch tested? Added regression test (also for SPARK-16740) Author: Davies Liu <davies@databricks.com> Closes #14464 from davies/fix_overflow.
* [SPARK-16853][SQL] fixes encoder error in DataSet typed selectSean Zhong2016-08-044-10/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? For DataSet typed select: ``` def select[U1: Encoder](c1: TypedColumn[T, U1]): Dataset[U1] ``` If type T is a case class or a tuple class that is not atomic, the resulting logical plan's schema will mismatch with `Dataset[T]` encoder's schema, which will cause encoder error and throw AnalysisException. ### Before change: ``` scala> case class A(a: Int, b: Int) scala> Seq((0, A(1,2))).toDS.select($"_2".as[A]) org.apache.spark.sql.AnalysisException: cannot resolve '`a`' given input columns: [_2]; .. ``` ### After change: ``` scala> case class A(a: Int, b: Int) scala> Seq((0, A(1,2))).toDS.select($"_2".as[A]).show +---+---+ | a| b| +---+---+ | 1| 2| +---+---+ ``` ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #14474 from clockfly/SPARK-16853.
* [SPARK-16867][SQL] createTable and alterTable in ExternalCatalog should not ↵Wenchen Fan2016-08-046-31/+28
| | | | | | | | | | | | | | | | take db ## What changes were proposed in this pull request? These 2 methods take `CatalogTable` as parameter, which already have the database information. ## How was this patch tested? existing test Author: Wenchen Fan <wenchen@databricks.com> Closes #14476 from cloud-fan/minor5.
* [SPARK-16888][SQL] Implements eval method for expression AssertNotNullSean Zhong2016-08-042-7/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Implements `eval()` method for expression `AssertNotNull` so that we can convert local projection on LocalRelation to another LocalRelation. ### Before change: ``` scala> import org.apache.spark.sql.catalyst.dsl.expressions._ scala> import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull scala> import org.apache.spark.sql.Column scala> case class A(a: Int) scala> Seq((A(1),2)).toDS().select(new Column(AssertNotNull("_1".attr, Nil))).explain java.lang.UnsupportedOperationException: Only code-generated evaluation is supported. at org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull.eval(objects.scala:850) ... ``` ### After the change: ``` scala> Seq((A(1),2)).toDS().select(new Column(AssertNotNull("_1".attr, Nil))).explain(true) == Parsed Logical Plan == 'Project [assertnotnull('_1) AS assertnotnull(_1)#5] +- LocalRelation [_1#2, _2#3] == Analyzed Logical Plan == assertnotnull(_1): struct<a:int> Project [assertnotnull(_1#2) AS assertnotnull(_1)#5] +- LocalRelation [_1#2, _2#3] == Optimized Logical Plan == LocalRelation [assertnotnull(_1)#5] == Physical Plan == LocalTableScan [assertnotnull(_1)#5] ``` ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #14486 from clockfly/assertnotnull_eval.
* [MINOR][SQL] Fix minor formatting issue of SortAggregateExec.toStringCheng Lian2016-08-041-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes a minor formatting issue (missing space after comma) of `SorgAggregateExec.toString`. Before: ``` SortAggregate(key=[a#76,b#77], functions=[max(c#78),min(c#78)], output=[a#76,b#77,max(c)#89,min(c)#90]) +- *Sort [a#76 ASC, b#77 ASC], false, 0 +- Exchange hashpartitioning(a#76, b#77, 200) +- SortAggregate(key=[a#76,b#77], functions=[partial_max(c#78),partial_min(c#78)], output=[a#76,b#77,max#99,min#100]) +- *Sort [a#76 ASC, b#77 ASC], false, 0 +- LocalTableScan <empty>, [a#76, b#77, c#78] ``` After: ``` SortAggregate(key=[a#76, b#77], functions=[max(c#78), min(c#78)], output=[a#76, b#77, max(c)#89, min(c)#90]) +- *Sort [a#76 ASC, b#77 ASC], false, 0 +- Exchange hashpartitioning(a#76, b#77, 200) +- SortAggregate(key=[a#76, b#77], functions=[partial_max(c#78), partial_min(c#78)], output=[a#76, b#77, max#99, min#100]) +- *Sort [a#76 ASC, b#77 ASC], false, 0 +- LocalTableScan <empty>, [a#76, b#77, c#78] ``` ## How was this patch tested? Manually tested. Author: Cheng Lian <lian@databricks.com> Closes #14480 from liancheng/fix-sort-based-agg-string-format.
* [SPARK-16873][CORE] Fix SpillReader NPE when spillFile has no datasharkd2016-08-031-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? SpillReader NPE when spillFile has no data. See follow logs: 16/07/31 20:54:04 INFO collection.ExternalSorter: spill memory to file:/data4/yarnenv/local/usercache/tesla/appcache/application_1465785263942_56138/blockmgr-db5f46c3-d7a4-4f93-8b77-565e469696fb/09/temp_shuffle_ec3ece08-4569-4197-893a-4a5dfcbbf9fa, fileSize:0.0 B 16/07/31 20:54:04 WARN memory.TaskMemoryManager: leak 164.3 MB memory from org.apache.spark.util.collection.ExternalSorter3db4b52d 16/07/31 20:54:04 ERROR executor.Executor: Managed memory leak detected; size = 190458101 bytes, TID = 2358516/07/31 20:54:04 ERROR executor.Executor: Exception in task 1013.0 in stage 18.0 (TID 23585) java.lang.NullPointerException at org.apache.spark.util.collection.ExternalSorter$SpillReader.cleanup(ExternalSorter.scala:624) at org.apache.spark.util.collection.ExternalSorter$SpillReader.nextBatchStream(ExternalSorter.scala:539) at org.apache.spark.util.collection.ExternalSorter$SpillReader.<init>(ExternalSorter.scala:507) at org.apache.spark.util.collection.ExternalSorter$SpillableIterator.spill(ExternalSorter.scala:816) at org.apache.spark.util.collection.ExternalSorter.forceSpill(ExternalSorter.scala:251) at org.apache.spark.util.collection.Spillable.spill(Spillable.scala:109) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:154) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:249) at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:112) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 16/07/31 20:54:30 INFO executor.Executor: Executor is trying to kill task 1090.1 in stage 18.0 (TID 23793) 16/07/31 20:54:30 INFO executor.CoarseGrainedExecutorBackend: Driver commanded a shutdown ## How was this patch tested? Manual test. Author: sharkd <sharkd.tu@gmail.com> Author: sharkdtu <sharkdtu@tencent.com> Closes #14479 from sharkdtu/master.
* [SPARK-16814][SQL] Fix deprecated parquet constructor usageHolden Karau2016-08-033-6/+29
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Replace deprecated ParquetWriter with the new builders ## How was this patch tested? Existing tests Author: Holden Karau <holden@us.ibm.com> Closes #14419 from holdenk/SPARK-16814-fix-deprecated-parquet-constructor-usage.
* [SPARK-16770][BUILD] Fix JLine dependency management and version (Sca…Stefan Schulze2016-08-037-23/+20
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? As of Scala 2.11.x there is no longer a org.scala-lang:jline version aligned to the scala version itself. Scala console now uses the plain jline:jline module. Spark's dependency management did not reflect this change properly, causing Maven to pull in Jline via transitive dependency. Unfortunately Jline 2.12 contained a minor but very annoying bug rendering the shell almost useless for developers with german keyboard layout. This request contains the following chages: - Exclude transitive dependency 'jline:jline' from hive-exec module - Remove global properties 'jline.version' and 'jline.groupId' - Add both properties and dependency to 'scala-2.11' profile - Add explicit dependency on 'jline:jline' to module 'spark-repl' ## How was this patch tested? - Running mvn dependency:tree and checking for correct Jline version 2.12.1 - Running full builds with assembly and checking for jline-2.12.1.jar in 'lib' folder of generated tarball Author: Stefan Schulze <stefan.schulze@pentasys.de> Closes #14429 from stsc-pentasys/SPARK-16770.
* [SPARK-14204][SQL] register driverClass rather than user-specified classKevin McHale2016-08-031-1/+1
| | | | | | | | | | | | | | This is a pull request that was originally merged against branch-1.6 as #12000, now being merged into master as well. srowen zzcclp JoshRosen This pull request fixes an issue in which cluster-mode executors fail to properly register a JDBC driver when the driver is provided in a jar by the user, but the driver class name is derived from a JDBC URL (rather than specified by the user). The consequence of this is that all JDBC accesses under the described circumstances fail with an IllegalStateException. I reported the issue here: https://issues.apache.org/jira/browse/SPARK-14204 My proposed solution is to have the executors register the JDBC driver class under all circumstances, not only when the driver is specified by the user. This patch was tested manually. I built an assembly jar, deployed it to a cluster, and confirmed that the problem was fixed. Author: Kevin McHale <kevin@premise.com> Closes #14420 from mchalek/mchalek-jdbc_driver_registration.
* [SPARK-16596] [SQL] Refactor DataSourceScanExec to do partition discovery at ↵Eric Liang2016-08-039-291/+356
| | | | | | | | | | | | | | | | | | | | execution instead of planning time ## What changes were proposed in this pull request? Partition discovery is rather expensive, so we should do it at execution time instead of during physical planning. Right now there is not much benefit since ListingFileCatalog will read scan for all partitions at planning time anyways, but this can be optimized in the future. Also, there might be more information for partition pruning not available at planning time. This PR moves a lot of the file scan logic from planning to execution time. All file scan operations are handled by `FileSourceScanExec`, which handles both batched and non-batched file scans. This requires some duplication with `RowDataSourceScanExec`, but is probably worth it so that `FileSourceScanExec` does not need to depend on an input RDD. TODO: In another pr, move DataSourceScanExec to it's own file. ## How was this patch tested? Existing tests (it might be worth adding a test that catalog.listFiles() is delayed until execution, but this can be delayed until there is an actual benefit to doing so). Author: Eric Liang <ekl@databricks.com> Closes #14241 from ericl/refactor.
* [SPARK-16714][SPARK-16735][SPARK-16646] array, map, greatest, least's type ↵Wenchen Fan2016-08-033-18/+97
| | | | | | | | | | | | | | | | | | | | | | | | | | | coercion should handle decimal type ## What changes were proposed in this pull request? Here is a table about the behaviours of `array`/`map` and `greatest`/`least` in Hive, MySQL and Postgres: | |Hive|MySQL|Postgres| |---|---|---|---|---| |`array`/`map`|can find a wider type with decimal type arguments, and will truncate the wider decimal type if necessary|can find a wider type with decimal type arguments, no truncation problem|can find a wider type with decimal type arguments, no truncation problem| |`greatest`/`least`|can find a wider type with decimal type arguments, and truncate if necessary, but can't do string promotion|can find a wider type with decimal type arguments, no truncation problem, but can't do string promotion|can find a wider type with decimal type arguments, no truncation problem, but can't do string promotion| I think these behaviours makes sense and Spark SQL should follow them. This PR fixes `array` and `map` by using `findWiderCommonType` to get the wider type. This PR fixes `greatest` and `least` by add a `findWiderTypeWithoutStringPromotion`, which provides similar semantic of `findWiderCommonType`, but without string promotion. ## How was this patch tested? new tests in `TypeCoersionSuite` Author: Wenchen Fan <wenchen@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #14439 from cloud-fan/bug.
* [SPARK-16831][PYTHON] Fixed bug in CrossValidator.avgMetrics=^_^=2016-08-031-1/+3
| | | | | | | | | | ## What changes were proposed in this pull request? avgMetrics was summed, not averaged, across folds Author: =^_^= <maxmoroz@gmail.com> Closes #14456 from pkch/pkch-patch-1.
* [SQL][MINOR] use stricter type parameter to make it clear that parquet ↵Wenchen Fan2016-08-033-10/+10
| | | | | | | | | | | | | | | | reader returns UnsafeRow ## What changes were proposed in this pull request? a small code style change, it's better to make the type parameter more accurate. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #14458 from cloud-fan/parquet.
* [SPARK-16796][WEB UI] Visible passwords on Spark environment pageArtur Sukhenko2016-08-021-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Mask spark.ssl.keyPassword, spark.ssl.keyStorePassword, spark.ssl.trustStorePassword in Web UI environment page. (Changes their values to ***** in env. page) ## How was this patch tested? I've built spark, run spark shell and checked that this values have been masked with *****. Also run tests: ./dev/run-tests [info] ScalaTest [info] Run completed in 1 hour, 9 minutes, 5 seconds. [info] Total number of tests run: 2166 [info] Suites: completed 65, aborted 0 [info] Tests: succeeded 2166, failed 0, canceled 0, ignored 590, pending 0 [info] All tests passed. ![mask](https://cloud.githubusercontent.com/assets/15244468/17262154/7641e132-55e2-11e6-8a6c-30ead77c7372.png) Author: Artur Sukhenko <artur.sukhenko@gmail.com> Closes #14409 from Devian-ua/maskpass.
* [SPARK-16858][SQL][TEST] Removal of TestHiveSharedStategatorsmile2016-08-022-60/+20
| | | | | | | | | | | | | | ### What changes were proposed in this pull request? This PR is to remove `TestHiveSharedState`. Also, this is also associated with the Hive refractoring for removing `HiveSharedState`. ### How was this patch tested? The existing test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #14463 from gatorsmile/removeTestHiveSharedState.
* [SPARK-16787] SparkContext.addFile() should not throw if called twice with ↵Josh Rosen2016-08-024-26/+78
| | | | | | | | | | | | | | | | | | | | | | | | | | | | the same file ## What changes were proposed in this pull request? The behavior of `SparkContext.addFile()` changed slightly with the introduction of the Netty-RPC-based file server, which was introduced in Spark 1.6 (where it was disabled by default) and became the default / only file server in Spark 2.0.0. Prior to 2.0, calling `SparkContext.addFile()` with files that have the same name and identical contents would succeed. This behavior was never explicitly documented but Spark has behaved this way since very early 1.x versions. In 2.0 (or 1.6 with the Netty file server enabled), the second `addFile()` call will fail with a requirement error because NettyStreamManager tries to guard against duplicate file registration. This problem also affects `addJar()` in a more subtle way: the `fileServer.addJar()` call will also fail with an exception but that exception is logged and ignored; I believe that the problematic exception-catching path was mistakenly copied from some old code which was only relevant to very old versions of Spark and YARN mode. I believe that this change of behavior was unintentional, so this patch weakens the `require` check so that adding the same filename at the same path will succeed. At file download time, Spark tasks will fail with exceptions if an executor already has a local copy of a file and that file's contents do not match the contents of the file being downloaded / added. As a result, it's important that we prevent files with the same name and different contents from being served because allowing that can effectively brick an executor by preventing it from successfully launching any new tasks. Before this patch's change, this was prevented by forbidding `addFile()` from being called twice on files with the same name. Because Spark does not defensively copy local files that are passed to `addFile` it is vulnerable to files' contents changing, so I think it's okay to rely on an implicit assumption that these files are intended to be immutable (since if they _are_ mutable then this can lead to either explicit task failures or implicit incorrectness (in case new executors silently get newer copies of the file while old executors continue to use an older version)). To guard against this, I have decided to only update the file addition timestamps on the first call to `addFile()`; duplicate calls will succeed but will not update the timestamp. This behavior is fine as long as we assume files are immutable, which seems reasonable given the behaviors described above. As part of this change, I also improved the thread-safety of the `addedJars` and `addedFiles` maps; this is important because these maps may be concurrently read by a task launching thread and written by a driver thread in case the user's driver code is multi-threaded. ## How was this patch tested? I added regression tests in `SparkContextSuite`. Author: Josh Rosen <joshrosen@databricks.com> Closes #14396 from JoshRosen/SPARK-16787.
* [SPARK-16855][SQL] move Greatest and Least from conditionalExpressions.scala ↵Wenchen Fan2016-08-024-229/+228
| | | | | | | | | | | | | | | | to arithmetic.scala ## What changes were proposed in this pull request? `Greatest` and `Least` are not conditional expressions, but arithmetic expressions. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #14460 from cloud-fan/move.
* [SPARK-16816] Modify java example which is also reflect in documentation exmaplesandy2016-08-021-0/+16
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Modify java example which is also reflect in document. ## How was this patch tested? run test cases. Author: sandy <phalodi@gmail.com> Closes #14436 from phalodi/SPARK-16816.
* [SPARK-16836][SQL] Add support for CURRENT_DATE/CURRENT_TIMESTAMP literalsHerman van Hovell2016-08-024-2/+32
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In Spark 1.6 (with Hive support) we could use `CURRENT_DATE` and `CURRENT_TIMESTAMP` functions as literals (without adding braces), for example: ```SQL select /* Spark 1.6: */ current_date, /* Spark 1.6 & Spark 2.0: */ current_date() ``` This was accidentally dropped in Spark 2.0. This PR reinstates this functionality. ## How was this patch tested? Added a case to ExpressionParserSuite. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #14442 from hvanhovell/SPARK-16836.
* [SPARK-16062] [SPARK-15989] [SQL] Fix two bugs of Python-only UDTsLiang-Chi Hsieh2016-08-023-3/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? There are two related bugs of Python-only UDTs. Because the test case of second one needs the first fix too. I put them into one PR. If it is not appropriate, please let me know. ### First bug: When MapObjects works on Python-only UDTs `RowEncoder` will use `PythonUserDefinedType.sqlType` for its deserializer expression. If the sql type is `ArrayType`, we will have `MapObjects` working on it. But `MapObjects` doesn't consider `PythonUserDefinedType` as its input data type. It causes error like: import pyspark.sql.group from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT from pyspark.sql.types import * schema = StructType().add("key", LongType()).add("val", PythonOnlyUDT()) df = spark.createDataFrame([(i % 3, PythonOnlyPoint(float(i), float(i))) for i in range(10)], schema=schema) df.show() File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o36.showString. : java.lang.RuntimeException: Error while decoding: scala.MatchError: org.apache.spark.sql.types.PythonUserDefinedTypef4ceede8 (of class org.apache.spark.sql.types.PythonUserDefinedType) ... ### Second bug: When Python-only UDTs is the element type of ArrayType import pyspark.sql.group from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT from pyspark.sql.types import * schema = StructType().add("key", LongType()).add("val", ArrayType(PythonOnlyUDT())) df = spark.createDataFrame([(i % 3, [PythonOnlyPoint(float(i), float(i))]) for i in range(10)], schema=schema) df.show() ## How was this patch tested? PySpark's sql tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13778 from viirya/fix-pyudt.
* [SPARK-16837][SQL] TimeWindow incorrectly drops slideDuration in constructorsTom Magrino2016-08-022-2/+14
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fix of incorrect arguments (dropping slideDuration and using windowDuration) in constructors for TimeWindow. The JIRA this addresses is here: https://issues.apache.org/jira/browse/SPARK-16837 ## How was this patch tested? Added a test to TimeWindowSuite to check that the results of TimeWindow object apply and TimeWindow class constructor are equivalent. Author: Tom Magrino <tmagrino@fb.com> Closes #14441 from tmagrino/windowing-fix.
* [SPARK-16822][DOC] Support latex in scaladoc.Shuai Lin2016-08-027-117/+225
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Support using latex in scaladoc by adding MathJax javascript to the js template. ## How was this patch tested? Generated scaladoc. Preview: - LogisticGradient: [before](https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient) and [after](https://sparkdocs.lins05.pw/spark-16822/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient) - MinMaxScaler: [before](https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler) and [after](https://sparkdocs.lins05.pw/spark-16822/api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler) Author: Shuai Lin <linshuai2012@gmail.com> Closes #14438 from lins05/spark-16822-support-latex-in-scaladoc.
* [SPARK-15541] Casting ConcurrentHashMap to ConcurrentMap (master branch)Maciej Brynski2016-08-021-3/+5
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Casting ConcurrentHashMap to ConcurrentMap allows to run code compiled with Java 8 on Java 7 ## How was this patch tested? Compilation. Existing automatic tests Author: Maciej Brynski <maciej.brynski@adpilot.pl> Closes #14459 from maver1ck/spark-15541-master.
* [SPARK-16558][EXAMPLES][MLLIB] examples/mllib/LDAExample should use MLVector ↵Xusen Yin2016-08-021-2/+3
| | | | | | | | | | | | | | | | instead of MLlib Vector ## What changes were proposed in this pull request? mllib.LDAExample uses ML pipeline and MLlib LDA algorithm. The former transforms original data into MLVector format, while the latter uses MLlibVector format. ## How was this patch tested? Test manually. Author: Xusen Yin <yinxusen@gmail.com> Closes #14212 from yinxusen/SPARK-16558.
* [SPARK-16851][ML] Incorrect threshould length in 'setThresholds()' evoke ↵Zheng RuiFeng2016-08-021-1/+6
| | | | | | | | | | | | | | Exception ## What changes were proposed in this pull request? Add a length checking for threshoulds' length in method `setThreshoulds()` of classification models. ## How was this patch tested? unit tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #14457 from zhengruifeng/check_setThresholds.
* [SPARK-16850][SQL] Improve type checking error message for greatest/leastpetermaxlee2016-08-022-2/+15
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Greatest/least function does not have the most friendly error message for data types. This patch improves the error message to not show the Seq type, and use more human readable data types. Before: ``` org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(CAST(1.0 AS DECIMAL(2,1)), "1.0")' due to data type mismatch: The expressions should all have the same type, got GREATEST (ArrayBuffer(DecimalType(2,1), StringType)).; line 1 pos 7 ``` After: ``` org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(CAST(1.0 AS DECIMAL(2,1)), "1.0")' due to data type mismatch: The expressions should all have the same type, got GREATEST(decimal(2,1), string).; line 1 pos 7 ``` ## How was this patch tested? Manually verified the output and also added unit tests to ConditionalExpressionSuite. Author: petermaxlee <petermaxlee@gmail.com> Closes #14453 from petermaxlee/SPARK-16850.
* [SPARK-16734][EXAMPLES][SQL] Revise examples of all language bindingsCheng Lian2016-08-029-117/+137
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR makes various minor updates to examples of all language bindings to make sure they are consistent with each other. Some typos and missing parts (JDBC example in Scala/Java/Python) are also fixed. ## How was this patch tested? Manually tested. Author: Cheng Lian <lian@databricks.com> Closes #14368 from liancheng/revise-examples.
* [SPARK-16793][SQL] Set the temporary warehouse path to sc'conf in TestHive.jiangxingbo2016-08-013-25/+21
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? With SPARK-15034, we could use the value of spark.sql.warehouse.dir to set the warehouse location. In TestHive, we can now simply set the temporary warehouse path in sc's conf, and thus, param "warehousePath" could be removed. ## How was this patch tested? exsiting testsuites. Author: jiangxingbo <jiangxingbo@meituan.com> Closes #14401 from jiangxb1987/warehousePath.
* [SPARK-16828][SQL] remove MaxOf and MinOfWenchen Fan2016-08-014-175/+0
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? These 2 expressions are not needed anymore after we have `Greatest` and `Least`. This PR removes them and related tests. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #14434 from cloud-fan/minor1.
* [SPARK-15869][STREAMING] Fix a potential NPE in ↵Shixiong Zhu2016-08-011-1/+1
| | | | | | | | | | | | | | | | StreamingJobProgressListener.getBatchUIData ## What changes were proposed in this pull request? Moved `asScala` to a `map` to avoid NPE. ## How was this patch tested? Existing unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #14443 from zsxwing/SPARK-15869.
* [SPARK-16774][SQL] Fix use of deprecated timestamp constructor & improve ↵Holden Karau2016-08-012-7/+10
| | | | | | | | | | | | | | | | | | | | timezone handling ## What changes were proposed in this pull request? Removes the deprecated timestamp constructor and incidentally fixes the use which was using system timezone rather than the one specified when working near DST. This change also causes the roundtrip tests to fail since it now actually uses all the timezones near DST boundaries where it didn't before. Note: this is only a partial the solution, longer term we should follow up with https://issues.apache.org/jira/browse/SPARK-16788 to avoid this problem & simplify our timezone handling code. ## How was this patch tested? New tests for two timezones added so even if user timezone happens to coincided with one, the other tests should still fail. Important note: this (temporarily) disables the round trip tests until we can fix the issue more thoroughly. Author: Holden Karau <holden@us.ibm.com> Closes #14398 from holdenk/SPARK-16774-fix-use-of-deprecated-timestamp-constructor.
* [SPARK-16791][SQL] cast struct with timestamp field failseyal farago2016-08-012-1/+11
| | | | | | | | | | | | | ## What changes were proposed in this pull request? a failing test case + fix to SPARK-16791 (https://issues.apache.org/jira/browse/SPARK-16791) ## How was this patch tested? added a failing test case to CastSuit, then fixed the Cast code and rerun the entire CastSuit Author: eyal farago <eyal farago> Author: Eyal Farago <eyal.farago@actimize.com> Closes #14400 from eyalfa/SPARK-16791_cast_struct_with_timestamp_field_fails.
* [SPARK-16776][STREAMING] Replace deprecated API in KafkaTestUtils for 0.10.0.hyukjinkwon2016-08-011-8/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR replaces the old Kafka API to 0.10.0 ones in `KafkaTestUtils`. The change include: - `Producer` to `KafkaProducer` - Change configurations to equalvant ones. (I referred [here](http://kafka.apache.org/documentation.html#producerconfigs) for 0.10.0 and [here](http://kafka.apache.org/082/documentation.html#producerconfigs ) for old, 0.8.2). This PR will remove the build warning as below: ```scala [WARNING] .../spark/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala:71: class Producer in package producer is deprecated: This class has been deprecated and will be removed in a future release. Please use org.apache.kafka.clients.producer.KafkaProducer instead. [WARNING] private var producer: Producer[String, String] = _ [WARNING] ^ [WARNING] .../spark/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala:181: class Producer in package producer is deprecated: This class has been deprecated and will be removed in a future release. Please use org.apache.kafka.clients.producer.KafkaProducer instead. [WARNING] producer = new Producer[String, String](new ProducerConfig(producerConfiguration)) [WARNING] ^ [WARNING] .../spark/streaming/kafka010/KafkaTestUtils.scala:181: class ProducerConfig in package producer is deprecated: This class has been deprecated and will be removed in a future release. Please use org.apache.kafka.clients.producer.ProducerConfig instead. [WARNING] producer = new Producer[String, String](new ProducerConfig(producerConfiguration)) [WARNING] ^ [WARNING] .../spark/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala:182: class KeyedMessage in package producer is deprecated: This class has been deprecated and will be removed in a future release. Please use org.apache.kafka.clients.producer.ProducerRecord instead. [WARNING] producer.send(messages.map { new KeyedMessage[String, String](topic, _ ) }: _*) [WARNING] ^ [WARNING] four warnings found [WARNING] warning: [options] bootstrap class path not set in conjunction with -source 1.7 [WARNING] 1 warning ``` ## How was this patch tested? Existing tests that use `KafkaTestUtils` should cover this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #14416 from HyukjinKwon/SPARK-16776.
* [SPARK-16778][SQL][TRIVIAL] Fix deprecation warning with SQLContextHolden Karau2016-08-011-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Change to non-deprecated constructor for SQLContext. ## How was this patch tested? Existing tests Author: Holden Karau <holden@us.ibm.com> Closes #14406 from holdenk/SPARK-16778-fix-use-of-deprecated-SQLContext-constructor.
* [SPARK-16485][DOC][ML] Remove useless latex in a log messge.Shuai Lin2016-08-011-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Removed useless latex in a log messge. ## How was this patch tested? Check generated scaladoc. Author: Shuai Lin <linshuai2012@gmail.com> Closes #14380 from lins05/fix-docs-formatting.
* [SPARK-16726][SQL] Improve `Union/Intersect/Except` error messages on ↵Dongjoon Hyun2016-08-012-13/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | incompatible types ## What changes were proposed in this pull request? Currently, `UNION` queries on incompatible types show misleading error messages, i.e., `unresolved operator Union`. We had better show a more correct message. This will help users in the situation of [SPARK-16704](https://issues.apache.org/jira/browse/SPARK-16704). **Before** ```scala scala> sql("select 1,2,3 union (select 1,array(2),3)") org.apache.spark.sql.AnalysisException: unresolved operator 'Union; scala> sql("select 1,2,3 intersect (select 1,array(2),3)") org.apache.spark.sql.AnalysisException: unresolved operator 'Intersect; scala> sql("select 1,2,3 except (select 1,array(2),3)") org.apache.spark.sql.AnalysisException: unresolved operator 'Except; ``` **After** ```scala scala> sql("select 1,2,3 union (select 1,array(2),3)") org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. ArrayType(IntegerType,false) <> IntegerType at the second column of the second table; scala> sql("select 1,2,3 intersect (select 1,array(2),3)") org.apache.spark.sql.AnalysisException: Intersect can only be performed on tables with the compatible column types. ArrayType(IntegerType,false) <> IntegerType at the second column of the second table; scala> sql("select 1,2,3 except (select array(1),array(2),3)") org.apache.spark.sql.AnalysisException: Except can only be performed on tables with the compatible column types. ArrayType(IntegerType,false) <> IntegerType at the first column of the second table; ``` ## How was this patch tested? Pass the Jenkins test with a new test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14355 from dongjoon-hyun/SPARK-16726.