aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-17473][SQL] fixing docker integration tests error due to different ↵sureshthalamati2016-09-192-69/+0
| | | | | | | | | | | | | | | versions of jars. ## What changes were proposed in this pull request? Docker tests are using older version of jersey jars (1.19), which was used in older releases of spark. In 2.0 releases Spark was upgraded to use 2.x verison of Jersey. After upgrade to new versions, docker tests are failing with AbstractMethodError. Now that spark is upgraded to 2.x jersey version, using of shaded docker jars may not be required any more. Removed the exclusions/overrides of jersey related classes from pom file, and changed the docker-client to use regular jar instead of shaded one. ## How was this patch tested? Tested using existing docker-integration-tests Author: sureshthalamati <suresh.thalamati@gmail.com> Closes #15114 from sureshthalamati/docker_testfix-spark-17473.
* [SPARK-17297][DOCS] Clarify window/slide duration as absolute time, not ↵Sean Owen2016-09-192-6/+17
| | | | | | | | | | | | | | | | relative to a calendar ## What changes were proposed in this pull request? Clarify that slide and window duration are absolute, and not relative to a calendar. ## How was this patch tested? Doc build (no functional change) Author: Sean Owen <sowen@cloudera.com> Closes #15142 from srowen/SPARK-17297.
* [SPARK-17571][SQL] AssertOnQuery.condition should always return Boolean valuepetermaxlee2016-09-183-4/+10
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? AssertOnQuery has two apply constructor: one that accepts a closure that returns boolean, and another that accepts a closure that returns Unit. This is actually very confusing because developers could mistakenly think that AssertOnQuery always require a boolean return type and verifies the return result, when indeed the value of the last statement is ignored in one of the constructors. This pull request makes the two constructor consistent and always require boolean value. It will overall make the test suites more robust against developer errors. As an evidence for the confusing behavior, this change also identified a bug with an existing test case due to file system time granularity. This pull request fixes that test case as well. ## How was this patch tested? This is a test only change. Author: petermaxlee <petermaxlee@gmail.com> Closes #15127 from petermaxlee/SPARK-17571.
* [SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV cast null values properlyLiwei Lin2016-09-187-83/+93
| | | | | | | | | | | | | | | | | | | | ## Problem CSV in Spark 2.0.0: - does not read null values back correctly for certain data types such as `Boolean`, `TimestampType`, `DateType` -- this is a regression comparing to 1.6; - does not read empty values (specified by `options.nullValue`) as `null`s for `StringType` -- this is compatible with 1.6 but leads to problems like SPARK-16903. ## What changes were proposed in this pull request? This patch makes changes to read all empty values back as `null`s. ## How was this patch tested? New test cases. Author: Liwei Lin <lwlin7@gmail.com> Closes #14118 from lw-lin/csv-cast-null.
* [SPARK-17586][BUILD] Do not call static member via instance referencehyukjinkwon2016-09-181-1/+1
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes a warning message as below: ``` [WARNING] .../UnsafeInMemorySorter.java:284: warning: [static] static method should be qualified by type name, TaskMemoryManager, instead of by an expression [WARNING] currentPageNumber = memoryManager.decodePageNumber(recordPointer) ``` by referencing the static member via class not instance reference. ## How was this patch tested? Existing tests should cover this - Jenkins tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15141 from HyukjinKwon/SPARK-17586.
* [SPARK-17546][DEPLOY] start-* scripts should use hostname -fSean Owen2016-09-183-3/+3
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Call `hostname -f` to get fully qualified host name ## How was this patch tested? Jenkins tests of course, but also verified output of command on OS X and Linux Author: Sean Owen <sowen@cloudera.com> Closes #15129 from srowen/SPARK-17546.
* [SPARK-17506][SQL] Improve the check double values equality rule.jiangxingbo2016-09-182-7/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In `ExpressionEvalHelper`, we check the equality between two double values by comparing whether the expected value is within the range [target - tolerance, target + tolerance], but this can cause a negative false when the compared numerics are very large. Before: ``` val1 = 1.6358558070241E306 val2 = 1.6358558070240974E306 ExpressionEvalHelper.compareResults(val1, val2) false ``` In fact, `val1` and `val2` are but with different precisions, we should tolerant this case by comparing with percentage range, eg.,expected is within range [target - target * tolerance_percentage, target + target * tolerance_percentage]. After: ``` val1 = 1.6358558070241E306 val2 = 1.6358558070240974E306 ExpressionEvalHelper.compareResults(val1, val2) true ``` ## How was this patch tested? Exsiting testcases. Author: jiangxingbo <jiangxb1987@gmail.com> Closes #15059 from jiangxb1987/deq.
* [SPARK-17541][SQL] fix some DDL bugs about table management when same-name ↵Wenchen Fan2016-09-1810-46/+170
| | | | | | | | | | | | | | | | | | | | | | temp view exists ## What changes were proposed in this pull request? In `SessionCatalog`, we have several operations(`tableExists`, `dropTable`, `loopupRelation`, etc) that handle both temp views and metastore tables/views. This brings some bugs to DDL commands that want to handle temp view only or metastore table/view only. These bugs are: 1. `CREATE TABLE USING` will fail if a same-name temp view exists 2. `Catalog.dropTempView`will un-cache and drop metastore table if a same-name table exists 3. `saveAsTable` will fail or have unexpected behaviour if a same-name temp view exists. These bug fixes are pulled out from https://github.com/apache/spark/pull/14962 and targets both master and 2.0 branch ## How was this patch tested? new regression tests Author: Wenchen Fan <wenchen@databricks.com> Closes #15099 from cloud-fan/fix-view.
* [SPARK-17518][SQL] Block Users to Specify the Internal Data Source Provider Hivegatorsmile2016-09-186-3/+67
| | | | | | | | | | | | ### What changes were proposed in this pull request? In Spark 2.1, we introduced a new internal provider `hive` for telling Hive serde tables from data source tables. This PR is to block users to specify this in `DataFrameWriter` and SQL APIs. ### How was this patch tested? Added a test case Author: gatorsmile <gatorsmile@gmail.com> Closes #15073 from gatorsmile/formatHive.
* [SPARK-17491] Close serialization stream to fix wrong answer bug in ↵Josh Rosen2016-09-178-44/+344
| | | | | | | | | | | | | | | | | | | | | | putIteratorAsBytes() ## What changes were proposed in this pull request? `MemoryStore.putIteratorAsBytes()` may silently lose values when used with `KryoSerializer` because it does not properly close the serialization stream before attempting to deserialize the already-serialized values, which may cause values buffered in Kryo's internal buffers to not be read. This is the root cause behind a user-reported "wrong answer" bug in PySpark caching reported by bennoleslie on the Spark user mailing list in a thread titled "pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK". Due to Spark 2.0's automatic use of KryoSerializer for "safe" types (such as byte arrays, primitives, etc.) this misuse of serializers manifested itself as silent data corruption rather than a StreamCorrupted error (which you might get from JavaSerializer). The minimal fix, implemented here, is to close the serialization stream before attempting to deserialize written values. In addition, this patch adds several additional assertions / precondition checks to prevent misuse of `PartiallySerializedBlock` and `ChunkedByteBufferOutputStream`. ## How was this patch tested? The original bug was masked by an invalid assert in the memory store test cases: the old assert compared two results record-by-record with `zip` but didn't first check that the lengths of the two collections were equal, causing missing records to go unnoticed. The updated test case reproduced this bug. In addition, I added a new `PartiallySerializedBlockSuite` to unit test that component. Author: Josh Rosen <joshrosen@databricks.com> Closes #15043 from JoshRosen/partially-serialized-block-values-iterator-bugfix.
* [SPARK-17480][SQL][FOLLOWUP] Fix more instances which calls List.length/size ↵hyukjinkwon2016-09-179-33/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | which is O(n) ## What changes were proposed in this pull request? This PR fixes all the instances which was fixed in the previous PR. To make sure, I manually debugged and also checked the Scala source. `length` in [LinearSeqOptimized.scala#L49-L57](https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/LinearSeqOptimized.scala#L49-L57) is O(n). Also, `size` calls `length` via [SeqLike.scala#L106](https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/SeqLike.scala#L106). For debugging, I have created these as below: ```scala ArrayBuffer(1, 2, 3) Array(1, 2, 3) List(1, 2, 3) Seq(1, 2, 3) ``` and then called `size` and `length` for each to debug. ## How was this patch tested? I ran the bash as below on Mac ```bash find . -name *.scala -type f -exec grep -il "while (.*\\.length)" {} \; | grep "src/main" find . -name *.scala -type f -exec grep -il "while (.*\\.size)" {} \; | grep "src/main" ``` and then checked each. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15093 from HyukjinKwon/SPARK-17480-followup.
* [SPARK-17575][DOCS] Remove extra table tags in configuration documentsandy2016-09-171-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Remove extra table tags in configurations document. ## How was this patch tested? Run all test cases and generate document. Before with extra tag its look like below ![config-wrong1](https://cloud.githubusercontent.com/assets/8075390/18608239/c602bb60-7d01-11e6-875e-f38558997dd3.png) ![config-wrong2](https://cloud.githubusercontent.com/assets/8075390/18608241/cf3b672c-7d01-11e6-935e-1e73f9e6e578.png) After removing tags its looks like below ![config](https://cloud.githubusercontent.com/assets/8075390/18608245/e156eb8e-7d01-11e6-98aa-3be68d4d1961.png) ![config2](https://cloud.githubusercontent.com/assets/8075390/18608247/e84eecd4-7d01-11e6-9738-a3f7ff8fe834.png) Author: sandy <phalodi@gmail.com> Closes #15130 from phalodi/SPARK-17575.
* [SPARK-17529][CORE] Implement BitSet.clearUntil and use it during merge joinsDavid Navas2016-09-173-12/+52
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add a clearUntil() method on BitSet (adapted from the pre-existing setUntil() method). Use this method to clear the subset of the BitSet which needs to be used during merge joins. ## How was this patch tested? dev/run-tests, as well as performance tests on skewed data as described in jira. I expect there to be a small local performance hit using BitSet.clearUntil rather than BitSet.clear for normally shaped (unskewed) joins (additional read on the last long). This is expected to be de-minimis and was not specifically tested. Author: David Navas <davidn@clearstorydata.com> Closes #15084 from davidnavas/bitSet.
* [SPARK-17548][MLLIB] Word2VecModel.findSynonyms no longer spuriously rejects ↵William Benton2016-09-175-24/+83
| | | | | | | | | | | | | | | | the best match when invoked with a vector ## What changes were proposed in this pull request? This pull request changes the behavior of `Word2VecModel.findSynonyms` so that it will not spuriously reject the best match when invoked with a vector that does not correspond to a word in the model's vocabulary. Instead of blindly discarding the best match, the changed implementation discards a match that corresponds to the query word (in cases where `findSynonyms` is invoked with a word) or that has an identical angle to the query vector. ## How was this patch tested? I added a test to `Word2VecSuite` to ensure that the word with the most similar vector from a supplied vector would not be spuriously rejected. Author: William Benton <willb@redhat.com> Closes #15105 from willb/fix/findSynonyms.
* [SPARK-17567][DOCS] Use valid url to Spark RDD paperXin Ren2016-09-171-1/+1
| | | | | | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-17567 ## What changes were proposed in this pull request? Documentation (http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.rdd.RDD) contains broken link to Spark paper (http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf). I found it elsewhere (https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf) and I hope it is the same one. It should be uploaded to and linked from some Apache controlled storage, so it won't break again. ## How was this patch tested? Tested manually on local laptop. Author: Xin Ren <iamshrek@126.com> Closes #15121 from keypointt/SPARK-17567.
* Correct fetchsize property name in docsDaniel Darabos2016-09-172-4/+4
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Replace `fetchSize` with `fetchsize` in the docs. ## How was this patch tested? I manually tested `fetchSize` and `fetchsize`. The latter has an effect. See also [`JdbcUtils.scala#L38`](https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L38) for the definition of the property. Author: Daniel Darabos <darabos.daniel@gmail.com> Closes #14975 from darabos/patch-3.
* [SPARK-17549][SQL] Only collect table size stat in driver for cached relation.Marcelo Vanzin2016-09-163-24/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | The existing code caches all stats for all columns for each partition in the driver; for a large relation, this causes extreme memory usage, which leads to gc hell and application failures. It seems that only the size in bytes of the data is actually used in the driver, so instead just colllect that. In executors, the full stats are still kept, but that's not a big problem; we expect the data to be distributed and thus not really incur in too much memory pressure in each individual executor. There are also potential improvements on the executor side, since the data being stored currently is very wasteful (e.g. storing boxed types vs. primitive types for stats). But that's a separate issue. On a mildly related change, I'm also adding code to catch exceptions in the code generator since Janino was breaking with the test data I tried this patch on. Tested with unit tests and by doing a count a very wide table (20k columns) with many partitions. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #15112 from vanzin/SPARK-17549.
* [SPARK-17561][DOCS] DataFrameWriter documentation formatting problemsSean Owen2016-09-163-29/+53
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fix `<ul> / <li>` problems in SQL scaladoc. ## How was this patch tested? Scaladoc build and manual verification of generated HTML. Author: Sean Owen <sowen@cloudera.com> Closes #15117 from srowen/SPARK-17561.
* [SPARK-17558] Bump Hadoop 2.7 version from 2.7.2 to 2.7.3Reynold Xin2016-09-162-16/+16
| | | | | | | | | | | | ## What changes were proposed in this pull request? This patch bumps the Hadoop version in hadoop-2.7 profile from 2.7.2 to 2.7.3, which was recently released and contained a number of bug fixes. ## How was this patch tested? The change should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #15115 from rxin/SPARK-17558.
* [SPARK-17426][SQL] Refactor `TreeNode.toJSON` to avoid OOM when converting ↵Sean Zhong2016-09-163-315/+333
| | | | | | | | | | | | | | | | | | | | | | | unknown fields to JSON ## What changes were proposed in this pull request? This PR is a follow up of SPARK-17356. Current implementation of `TreeNode.toJSON` recursively converts all fields of TreeNode to JSON, even if the field is of type `Seq` or type Map. This may trigger out of memory exception in cases like: 1. the Seq or Map can be very big. Converting them to JSON may take huge memory, which may trigger out of memory error. 2. Some user space input may also be propagated to the Plan. The user space input can be of arbitrary type, and may also be self-referencing. Trying to print user space input to JSON may trigger out of memory error or stack overflow error. For a code example, please check the Jira description of SPARK-17426. In this PR, we refactor the `TreeNode.toJSON` so that we only convert a field to JSON string if the field is a safe type. ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #14990 from clockfly/json_oom2.
* [SPARK-17534][TESTS] Increase timeouts for DirectKafkaStreamSuite testsAdam Roberts2016-09-161-4/+4
| | | | | | | | | | | | | | | | **## What changes were proposed in this pull request?** There are two tests in this suite that are particularly flaky on the following hardware: 2x Intel(R) Xeon(R) CPU E5-2697 v2 2.70GHz and 16 GB of RAM, 1 TB HDD This simple PR increases the timeout times and batch duration so they can reliably pass **## How was this patch tested?** Existing unit tests with the two core box where I was seeing the failures often Author: Adam Roberts <aroberts@uk.ibm.com> Closes #15094 from a-roberts/patch-6.
* [SPARK-17543] Missing log4j config file for tests in common/network-…Jagadeesan2016-09-161-0/+24
| | | | | | | | | | | | ## What changes were proposed in this pull request? The Maven module `common/network-shuffle` does not have a log4j configuration file for its test cases. So, added `log4j.properties` in the directory `src/test/resources`. …shuffle] Author: Jagadeesan <as2@us.ibm.com> Closes #15108 from jagadeesanas2/SPARK-17543.
* [SPARK-17458][SQL] Alias specified for aggregates in a pivot are not honoredAndrew Ray2016-09-152-1/+20
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This change preserves aliases that are given for pivot aggregations ## How was this patch tested? New unit test Author: Andrew Ray <ray.andrew@gmail.com> Closes #15111 from aray/SPARK-17458.
* [SPARK-17484] Prevent invalid block locations from being reported after ↵Josh Rosen2016-09-152-8/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | put() exceptions ## What changes were proposed in this pull request? If a BlockManager `put()` call failed after the BlockManagerMaster was notified of a block's availability then incomplete cleanup logic in a `finally` block would never send a second block status method to inform the master of the block's unavailability. This, in turn, leads to fetch failures and used to be capable of causing complete job failures before #15037 was fixed. This patch addresses this issue via multiple small changes: - The `finally` block now calls `removeBlockInternal` when cleaning up from a failed `put()`; in addition to removing the `BlockInfo` entry (which was _all_ that the old cleanup logic did), this code (redundantly) tries to remove the block from the memory and disk stores (as an added layer of defense against bugs lower down in the stack) and optionally notifies the master of block removal (which now happens during exception-triggered cleanup). - When a BlockManager receives a request for a block that it does not have it will now notify the master to update its block locations. This ensures that bad metadata pointing to non-existent blocks will eventually be fixed. Note that I could have implemented this logic in the block manager client (rather than in the remote server), but that would introduce the problem of distinguishing between transient and permanent failures; on the server, however, we know definitively that the block isn't present. - Catch `NonFatal` instead of `Exception` to avoid swallowing `InterruptedException`s thrown from synchronous block replication calls. This patch depends upon the refactorings in #15036, so that other patch will also have to be backported when backporting this fix. For more background on this issue, including example logs from a real production failure, see [SPARK-17484](https://issues.apache.org/jira/browse/SPARK-17484). ## How was this patch tested? Two new regression tests in BlockManagerSuite. Author: Josh Rosen <joshrosen@databricks.com> Closes #15085 from JoshRosen/SPARK-17484.
* [SPARK-17364][SQL] Antlr lexer wrongly treats full qualified identifier as a ↵Sean Zhong2016-09-153-9/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | decimal number token when parsing SQL string ## What changes were proposed in this pull request? The Antlr lexer we use to tokenize a SQL string may wrongly tokenize a fully qualified identifier as a decimal number token. For example, table identifier `default.123_table` is wrongly tokenized as ``` default // Matches lexer rule IDENTIFIER .123 // Matches lexer rule DECIMAL_VALUE _TABLE // Matches lexer rule IDENTIFIER ``` The correct tokenization for `default.123_table` should be: ``` default // Matches lexer rule IDENTIFIER, . // Matches a single dot 123_TABLE // Matches lexer rule IDENTIFIER ``` This PR fix the Antlr grammar so that it can tokenize fully qualified identifier correctly: 1. Fully qualified table name can be parsed correctly. For example, `select * from database.123_suffix`. 2. Fully qualified column name can be parsed correctly, for example `select a.123_suffix from a`. ### Before change #### Case 1: Failed to parse fully qualified column name ``` scala> spark.sql("select a.123_column from a").show org.apache.spark.sql.catalyst.parser.ParseException: extraneous input '.123' expecting {<EOF>, ... , IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 8) == SQL == select a.123_column from a --------^^^ ``` #### Case 2: Failed to parse fully qualified table name ``` scala> spark.sql("select * from default.123_table") org.apache.spark.sql.catalyst.parser.ParseException: extraneous input '.123' expecting {<EOF>, ... IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 21) == SQL == select * from default.123_table ---------------------^^^ ``` ### After Change #### Case 1: fully qualified column name, no ParseException thrown ``` scala> spark.sql("select a.123_column from a").show ``` #### Case 2: fully qualified table name, no ParseException thrown ``` scala> spark.sql("select * from default.123_table") ``` ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #15006 from clockfly/SPARK-17364.
* [SPARK-17429][SQL] use ImplicitCastInputTypes with function Length岑玉海2016-09-152-5/+7
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? select length(11); select length(2.0); these sql will return errors, but hive is ok. this PR will support casting input types implicitly for function length the correct result is: select length(11) return 2 select length(2.0) return 3 Author: 岑玉海 <261810726@qq.com> Author: cenyuhai <cenyuhai@didichuxing.com> Closes #15014 from cenyuhai/SPARK-17429.
* [SPARK-17114][SQL] Fix aggregates grouped by literals with empty inputHerman van Hovell2016-09-154-3/+86
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes an issue with aggregates that have an empty input, and use a literals as their grouping keys. These aggregates are currently interpreted as aggregates **without** grouping keys, this triggers the ungrouped code path (which aways returns a single row). This PR fixes the `RemoveLiteralFromGroupExpressions` optimizer rule, which changes the semantics of the Aggregate by eliminating all literal grouping keys. ## How was this patch tested? Added tests to `SQLQueryTestSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #15101 from hvanhovell/SPARK-17114-3.
* [SPARK-17547] Ensure temp shuffle data file is cleaned up after errorJosh Rosen2016-09-154-49/+73
| | | | | | | | | | SPARK-8029 (#9610) modified shuffle writers to first stage their data to a temporary file in the same directory as the final destination file and then to atomically rename this temporary file at the end of the write job. However, this change introduced the potential for the temporary output file to be leaked if an exception occurs during the write because the shuffle writers' existing error cleanup code doesn't handle deletion of the temp file. This patch avoids this potential cause of disk-space leaks by adding `finally` blocks to ensure that temp files are always deleted if they haven't been renamed. Author: Josh Rosen <joshrosen@databricks.com> Closes #15104 from JoshRosen/cleanup-tmp-data-file-in-shuffle-writer.
* [SPARK-17379][BUILD] Upgrade netty-all to 4.0.41 final for bug fixesAdam Roberts2016-09-157-6/+11
| | | | | | | | | | | | ## What changes were proposed in this pull request? Upgrade netty-all to latest in the 4.0.x line which is 4.0.41, mentions several bug fixes and performance improvements we may find useful, see netty.io/news/2016/08/29/4-0-41-Final-4-1-5-Final.html. Initially tried to use 4.1.5 but noticed it's not backwards compatible. ## How was this patch tested? Existing unit tests against branch-1.6 and branch-2.0 using IBM Java 8 on Intel, Power and Z architectures Author: Adam Roberts <aroberts@uk.ibm.com> Closes #14961 from a-roberts/netty.
* [SPARK-17451][CORE] CoarseGrainedExecutorBackend should inform driver before ↵Tejas Patil2016-09-152-6/+23
| | | | | | | | | | | | | | | | | | | | | | | | self-kill ## What changes were proposed in this pull request? Jira : https://issues.apache.org/jira/browse/SPARK-17451 `CoarseGrainedExecutorBackend` in some failure cases exits the JVM. While this does not have any issue, from the driver UI there is no specific reason captured for this. In this PR, I am adding functionality to `exitExecutor` to notify driver that the executor is exiting. ## How was this patch tested? Ran the change over a test env and took down shuffle service before the executor could register to it. In the driver logs, where the job failure reason is mentioned (ie. `Job aborted due to stage ...` it gives the correct reason: Before: `ExecutorLostFailure (executor ZZZZZZZZZ exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.` After: `ExecutorLostFailure (executor ZZZZZZZZZ exited caused by one of the running tasks) Reason: Unable to create executor due to java.util.concurrent.TimeoutException: Timeout waiting for task.` Author: Tejas Patil <tejasp@fb.com> Closes #15013 from tejasapatil/SPARK-17451_inform_driver.
* [SPARK-17406][BUILD][HOTFIX] MiMa excludes fixSean Owen2016-09-151-12/+17
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Following https://github.com/apache/spark/pull/14969 for some reason the MiMa excludes weren't complete, but still passed the PR builder. This adds 3 more excludes from https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.2/1749/consoleFull It also moves the excludes to their own Seq in the build, as they probably should have been. Even though this is merged to 2.1.x only / master, I left the exclude in for 2.0.x in case we back port. It's a private API so is always a false positive. ## How was this patch tested? Jenkins build Author: Sean Owen <sowen@cloudera.com> Closes #15110 from srowen/SPARK-17406.2.
* [SPARK-17536][SQL] Minor performance improvement to JDBC batch insertsJohn Muller2016-09-151-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Optimize a while loop during batch inserts ## How was this patch tested? Unit tests were done, specifically "mvn test" for sql Author: John Muller <jmuller@us.imshealth.com> Closes #15098 from blue666man/SPARK-17536.
* [SPARK-17406][WEB UI] limit timeline executor eventscenyuhai2016-09-158-148/+162
| | | | | | | | | ## What changes were proposed in this pull request? The job page will be too slow to open when there are thousands of executor events(added or removed). I found that in ExecutorsTab file, executorIdToData will not remove elements, it will increase all the time.Before this pr, it looks like [timeline1.png](https://issues.apache.org/jira/secure/attachment/12827112/timeline1.png). After this pr, it looks like [timeline2.png](https://issues.apache.org/jira/secure/attachment/12827113/timeline2.png)(we can set how many executor events will be displayed) Author: cenyuhai <cenyuhai@didichuxing.com> Closes #14969 from cenyuhai/SPARK-17406.
* [SPARK-17521] Error when I use sparkContext.makeRDD(Seq())codlife2016-09-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? when i use sc.makeRDD below ``` val data3 = sc.makeRDD(Seq()) println(data3.partitions.length) ``` I got an error: Exception in thread "main" java.lang.IllegalArgumentException: Positive number of slices required We can fix this bug just modify the last line ,do a check of seq.size ``` def makeRDD[T: ClassTag](seq: Seq[(T, Seq[String])]): RDD[T] = withScope { assertNotStopped() val indexToPrefs = seq.zipWithIndex.map(t => (t._2, t._1._2)).toMap new ParallelCollectionRDD[T](this, seq.map(_._1), math.max(seq.size, defaultParallelism), indexToPrefs) } ``` ## How was this patch tested? manual tests (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: codlife <1004910847@qq.com> Author: codlife <wangjianfei15@otcaix.iscas.ac.cn> Closes #15077 from codlife/master.
* [SPARK-17524][TESTS] Use specified spark.buffer.pageSizeAdam Roberts2016-09-151-2/+4
| | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR has the appendRowUntilExceedingPageSize test in RowBasedKeyValueBatchSuite use whatever spark.buffer.pageSize value a user has specified to prevent a test failure for anyone testing Apache Spark on a box with a reduced page size. The test is currently hardcoded to use the default page size which is 64 MB so this minor PR is a test improvement ## How was this patch tested? Existing unit tests with 1 MB page size and with 64 MB (the default) page size Author: Adam Roberts <aroberts@uk.ibm.com> Closes #15079 from a-roberts/patch-5.
* [SPARK-17507][ML][MLLIB] check weight vector size in ANNWeichenXu2016-09-151-6/+4
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? as the TODO described, check weight vector size and if wrong throw exception. ## How was this patch tested? existing tests. Author: WeichenXu <WeichenXu123@outlook.com> Closes #15060 from WeichenXu123/check_input_weight_size_of_ann.
* [SPARK-17440][SPARK-17441] Fixed Multiple Bugs in ALTER TABLEgatorsmile2016-09-154-59/+120
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ### What changes were proposed in this pull request? For the following `ALTER TABLE` DDL, we should issue an exception when the target table is a `VIEW`: ```SQL ALTER TABLE viewName SET LOCATION '/path/to/your/lovely/heart' ALTER TABLE viewName SET SERDE 'whatever' ALTER TABLE viewName SET SERDEPROPERTIES ('x' = 'y') ALTER TABLE viewName PARTITION (a=1, b=2) SET SERDEPROPERTIES ('x' = 'y') ALTER TABLE viewName ADD IF NOT EXISTS PARTITION (a='4', b='8') ALTER TABLE viewName DROP IF EXISTS PARTITION (a='2') ALTER TABLE viewName RECOVER PARTITIONS ALTER TABLE viewName PARTITION (a='1', b='q') RENAME TO PARTITION (a='100', b='p') ``` In addition, `ALTER TABLE RENAME PARTITION` is unable to handle data source tables, just like the other `ALTER PARTITION` commands. We should issue an exception instead. ### How was this patch tested? Added a few test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #15004 from gatorsmile/altertable.
* [SPARK-17465][SPARK CORE] Inappropriate memory management in ↵Xing SHI2016-09-141-3/+3
| | | | | | | | | | | | | | `org.apache.spark.storage.MemoryStore` may lead to memory leak The expression like `if (memoryMap(taskAttemptId) == 0) memoryMap.remove(taskAttemptId)` in method `releaseUnrollMemoryForThisTask` and `releasePendingUnrollMemoryForThisTask` should be called after release memory operation, whatever `memoryToRelease` is > 0 or not. If the memory of a task has been set to 0 when calling a `releaseUnrollMemoryForThisTask` or a `releasePendingUnrollMemoryForThisTask` method, the key in the memory map corresponding to that task will never be removed from the hash map. See the details in [SPARK-17465](https://issues.apache.org/jira/browse/SPARK-17465). Author: Xing SHI <shi-kou@indetail.co.jp> Closes #15022 from saturday-shi/SPARK-17465.
* [SPARK-17472] [PYSPARK] Better error message for serialization failures of ↵Eric Liang2016-09-142-1/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | large objects in Python ## What changes were proposed in this pull request? For large objects, pickle does not raise useful error messages. However, we can wrap them to be slightly more user friendly: Example 1: ``` def run(): import numpy.random as nr b = nr.bytes(8 * 1000000000) sc.parallelize(range(1000), 1000).map(lambda x: len(b)).count() run() ``` Before: ``` error: 'i' format requires -2147483648 <= number <= 2147483647 ``` After: ``` pickle.PicklingError: Object too large to serialize: 'i' format requires -2147483648 <= number <= 2147483647 ``` Example 2: ``` def run(): import numpy.random as nr b = sc.broadcast(nr.bytes(8 * 1000000000)) sc.parallelize(range(1000), 1000).map(lambda x: len(b.value)).count() run() ``` Before: ``` SystemError: error return without exception set ``` After: ``` cPickle.PicklingError: Could not serialize broadcast: SystemError: error return without exception set ``` ## How was this patch tested? Manually tried out these cases cc davies Author: Eric Liang <ekl@databricks.com> Closes #15026 from ericl/spark-17472.
* [SPARK-17463][CORE] Make CollectionAccumulator and SetAccumulator's value ↵Shixiong Zhu2016-09-145-32/+54
| | | | | | | | | | | | | | | | can be read thread-safely ## What changes were proposed in this pull request? Make CollectionAccumulator and SetAccumulator's value can be read thread-safely to fix the ConcurrentModificationException reported in [JIRA](https://issues.apache.org/jira/browse/SPARK-17463). ## How was this patch tested? Existing tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #15063 from zsxwing/SPARK-17463.
* [SPARK-17511] Yarn Dynamic Allocation: Avoid marking released container as ↵Kishor Patil2016-09-142-29/+52
| | | | | | | | | | | | | | | | | | | | | | | Failed ## What changes were proposed in this pull request? Due to race conditions, the ` assert(numExecutorsRunning <= targetNumExecutors)` can fail causing `AssertionError`. So removed the assertion, instead moved the conditional check before launching new container: ``` java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156) at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1.org$apache$spark$deploy$yarn$YarnAllocator$$anonfun$$updateInternalState$1(YarnAllocator.scala:489) at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1$$anon$1.run(YarnAllocator.scala:519) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` ## How was this patch tested? This was manually tested using a large ForkAndJoin job with Dynamic Allocation enabled to validate the failing job succeeds, without any such exception. Author: Kishor Patil <kpatil@yahoo-inc.com> Closes #15069 from kishorvpatil/SPARK-17511.
* [SPARK-10747][SQL] Support NULLS FIRST|LAST clause in ORDER BYXin Wu2016-09-1446-80/+639
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently, ORDER BY clause returns nulls value according to sorting order (ASC|DESC), considering null value is always smaller than non-null values. However, SQL2003 standard support NULLS FIRST or NULLS LAST to allow users to specify whether null values should be returned first or last, regardless of sorting order (ASC|DESC). This PR is to support this new feature. ## How was this patch tested? New test cases are added to test NULLS FIRST|LAST for regular select queries and windowing queries. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Xin Wu <xinwu@us.ibm.com> Closes #14842 from xwu0226/SPARK-10747.
* [MINOR][SQL] Add missing functions for some options in SQLConf and use them ↵hyukjinkwon2016-09-1512-31/+49
| | | | | | | | | | | | | | | | | | | | where applicable ## What changes were proposed in this pull request? I first thought they are missing because they are kind of hidden options but it seems they are just missing. For example, `spark.sql.parquet.mergeSchema` is documented in [sql-programming-guide.md](https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md) but this function is missing whereas many options such as `spark.sql.join.preferSortMergeJoin` are not documented but have its own function individually. So, this PR suggests making them consistent by adding the missing functions for some options in `SQLConf` and use them where applicable, in order to make them more readable. ## How was this patch tested? Existing tests should cover this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #14678 from HyukjinKwon/sqlconf-cleanup.
* [SPARK-17514] df.take(1) and df.limit(1).collect() should perform the same ↵Josh Rosen2016-09-144-18/+26
| | | | | | | | | | | | | | | | | | | | in Python ## What changes were proposed in this pull request? In PySpark, `df.take(1)` runs a single-stage job which computes only one partition of the DataFrame, while `df.limit(1).collect()` computes all partitions and runs a two-stage job. This difference in performance is confusing. The reason why `limit(1).collect()` is so much slower is that `collect()` internally maps to `df.rdd.<some-pyspark-conversions>.toLocalIterator`, which causes Spark SQL to build a query where a global limit appears in the middle of the plan; this, in turn, ends up being executed inefficiently because limits in the middle of plans are now implemented by repartitioning to a single task rather than by running a `take()` job on the driver (this was done in #7334, a patch which was a prerequisite to allowing partition-local limits to be pushed beneath unions, etc.). In order to fix this performance problem I think that we should generalize the fix from SPARK-10731 / #8876 so that `DataFrame.collect()` also delegates to the Scala implementation and shares the same performance properties. This patch modifies `DataFrame.collect()` to first collect all results to the driver and then pass them to Python, allowing this query to be planned using Spark's `CollectLimit` optimizations. ## How was this patch tested? Added a regression test in `sql/tests.py` which asserts that the expected number of jobs, stages, and tasks are run for both queries. Author: Josh Rosen <joshrosen@databricks.com> Closes #15068 from JoshRosen/pyspark-collect-limit.
* [SPARK-17409][SQL] Do Not Optimize Query in CTAS More Than Oncegatorsmile2016-09-1413-42/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ### What changes were proposed in this pull request? As explained in https://github.com/apache/spark/pull/14797: >Some analyzer rules have assumptions on logical plans, optimizer may break these assumption, we should not pass an optimized query plan into QueryExecution (will be analyzed again), otherwise we may some weird bugs. For example, we have a rule for decimal calculation to promote the precision before binary operations, use PromotePrecision as placeholder to indicate that this rule should not apply twice. But a Optimizer rule will remove this placeholder, that break the assumption, then the rule applied twice, cause wrong result. We should not optimize the query in CTAS more than once. For example, ```Scala spark.range(99, 101).createOrReplaceTempView("tab1") val sqlStmt = "SELECT id, cast(id as long) * cast('1.0' as decimal(38, 18)) as num FROM tab1" sql(s"CREATE TABLE tab2 USING PARQUET AS $sqlStmt") checkAnswer(spark.table("tab2"), sql(sqlStmt)) ``` Before this PR, the results do not match ``` == Results == !== Correct Answer - 2 == == Spark Answer - 2 == ![100,100.000000000000000000] [100,null] [99,99.000000000000000000] [99,99.000000000000000000] ``` After this PR, the results match. ``` +---+----------------------+ |id |num | +---+----------------------+ |99 |99.000000000000000000 | |100|100.000000000000000000| +---+----------------------+ ``` In this PR, we do not treat the `query` in CTAS as a child. Thus, the `query` will not be optimized when optimizing CTAS statement. However, we still need to analyze it for normalizing and verifying the CTAS in the Analyzer. Thus, we do it in the analyzer rule `PreprocessDDL`, because so far only this rule needs the analyzed plan of the `query`. ### How was this patch tested? Added a test Author: gatorsmile <gatorsmile@gmail.com> Closes #15048 from gatorsmile/ctasOptimized.
* [SPARK-17445][DOCS] Reference an ASF page as the main place to find ↵Sean Owen2016-09-149-19/+18
| | | | | | | | | | | | | | | | | | third-party packages ## What changes were proposed in this pull request? Point references to spark-packages.org to https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects This will be accompanied by a parallel change to the spark-website repo, and additional changes to this wiki. ## How was this patch tested? Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes #15075 from srowen/SPARK-17445.
* [SPARK-17480][SQL] Improve performance by removing or caching List.length ↵Ergin Seyfe2016-09-143-10/+9
| | | | | | | | | | | | | | | | | | | | which is O(n) ## What changes were proposed in this pull request? Scala's List.length method is O(N) and it makes the gatherCompressibilityStats function O(N^2). Eliminate the List.length calls by writing it in Scala way. https://github.com/scala/scala/blob/2.10.x/src/library/scala/collection/LinearSeqOptimized.scala#L36 As suggested. Extended the fix to HiveInspectors and AggregationIterator classes as well. ## How was this patch tested? Profiled a Spark job and found that CompressibleColumnBuilder is using 39% of the CPU. Out of this 39% CompressibleColumnBuilder->gatherCompressibilityStats is using 23% of it. 6.24% of the CPU is spend on List.length which is called inside gatherCompressibilityStats. After this change we started to save 6.24% of the CPU. Author: Ergin Seyfe <eseyfe@fb.com> Closes #15032 from seyfe/gatherCompressibilityStats.
* [CORE][DOC] remove redundant commentwm624@hotmail.com2016-09-141-9/+9
| | | | | | | | | | | ## What changes were proposed in this pull request? In the comment, there is redundant `the estimated`. This PR simply remove the redundant comment and adjusts format. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #15091 from wangmiao1981/comment.
* [SPARK-17525][PYTHON] Remove SparkContext.clearFiles() from the PySpark API ↵Sami Jaktholm2016-09-141-8/+0
| | | | | | | | | | | | | | | | as it was removed from the Scala API prior to Spark 2.0.0 ## What changes were proposed in this pull request? This pull request removes the SparkContext.clearFiles() method from the PySpark API as the method was removed from the Scala API in 8ce645d4eeda203cf5e100c4bdba2d71edd44e6a. Using that method in PySpark leads to an exception as PySpark tries to call the non-existent method on the JVM side. ## How was this patch tested? Existing tests (though none of them tested this particular method). Author: Sami Jaktholm <sjakthol@outlook.com> Closes #15081 from sjakthol/pyspark-sc-clearfiles.
* [SPARK-17449][DOCUMENTATION] Relation between heartbeatInterval and…Jagadeesan2016-09-142-1/+3
| | | | | | | | | | | | ## What changes were proposed in this pull request? The relation between spark.network.timeout and spark.executor.heartbeatInterval should be mentioned in the document. … network timeout] Author: Jagadeesan <as2@us.ibm.com> Closes #15042 from jagadeesanas2/SPARK-17449.