aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-12600][SQL] Remove deprecated methods in Spark SQLReynold Xin2016-01-0428-1295/+174
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #10559 from rxin/remove-deprecated-sql.
* [SPARK-12509][SQL] Fixed error messages for DataFrame correlation and covarianceNarine Kokhlikyan2016-01-041-6/+7
| | | | | | | | | | | | | | Currently, when we call corr or cov on dataframe with invalid input we see these error messages for both corr and cov: - "Currently cov supports calculating the covariance between two columns" - "Covariance calculation for columns with dataType "[DataType Name]" not supported." I've fixed this issue by passing the function name as an argument. We could also do the input checks separately for each function. I avoided doing that because of code duplication. Thanks! Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com> Closes #10458 from NarineK/sparksqlstatsmessages.
* [SPARK-12589][SQL] Fix UnsafeRowParquetRecordReader to properly set the row ↵Nong Li2016-01-043-0/+37
| | | | | | | | | | | | length. The reader was previously not setting the row length meaning it was wrong if there were variable length columns. This problem does not manifest usually, since the value in the column is correct and projecting the row fixes the issue. Author: Nong Li <nong@databricks.com> Closes #10576 from nongli/spark-12589.
* [SPARK-12541] [SQL] support cube/rollup as functionDavies Liu2016-01-048-48/+87
| | | | | | | | | | | This PR enable cube/rollup as function, so they can be used as this: ``` select a, b, sum(c) from t group by rollup(a, b) ``` Author: Davies Liu <davies@databricks.com> Closes #10522 from davies/rollup.
* [SPARK-9622][ML] DecisionTreeRegressor: provide variance of predictionYanbo Liang2016-01-045-4/+92
| | | | | | | | DecisionTreeRegressor will provide variance of prediction as a Double column. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8866 from yanboliang/spark-9622.
* [SPARK-11259][ML] Params.validateParams() should be called automaticallyYanbo Liang2016-01-0430-1/+63
| | | | | | | | See JIRA: https://issues.apache.org/jira/browse/SPARK-11259 Author: Yanbo Liang <ybliang8@gmail.com> Closes #9224 from yanboliang/spark-11259.
* [SPARK-12421][SQL] Prevent Internal/External row from exposing state.Herman van Hovell2016-01-042-4/+34
| | | | | | | | | | | | It is currently possible to change the values of the supposedly immutable ```GenericRow``` and ```GenericInternalRow``` classes. This is caused by the fact that scala's ArrayOps ```toArray``` (returned by calling ```toSeq```) will return the backing array instead of a copy. This PR fixes this problem. This PR was inspired by https://github.com/apache/spark/pull/10374 by apo1. cc apo1 sarutak marmbrus cloud-fan nongli (everyone in the previous conversation). Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10553 from hvanhovell/SPARK-12421.
* [DOC] Adjust coverage for partitionBy()tedyu2016-01-041-1/+1
| | | | | | | | | | | | This is the related thread: http://search-hadoop.com/m/q3RTtO3ReeJ1iF02&subj=Re+partitioning+json+data+in+spark Michael suggested fixing the doc. Please review. Author: tedyu <yuzhihong@gmail.com> Closes #10499 from ted-yu/master.
* [SPARK-12512][SQL] support column name with dot in withColumn()Xiu Guo2016-01-042-12/+27
| | | | | | Author: Xiu Guo <xguo27@gmail.com> Closes #10500 from xguo27/SPARK-12512.
* [SPARK-12608][STREAMING] Remove submitJobThreadPool since submitJob doesn't ↵Shixiong Zhu2016-01-041-6/+1
| | | | | | | | | | create a separate thread to wait for the job result Before #9264, submitJob would create a separate thread to wait for the job result. `submitJobThreadPool` was a workaround in `ReceiverTracker` to run these waiting-job-result threads. Now #9264 has been merged to master and resolved this blocking issue, `submitJobThreadPool` can be removed now. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10560 from zsxwing/remove-submitJobThreadPool.
* [SPARK-12470] [SQL] Fix size reduction calculationPete Robbins2016-01-041-4/+4
| | | | | | | | also only allocate required buffer size Author: Pete Robbins <robbinspg@gmail.com> Closes #10421 from robbinspg/master.
* [SPARK-12579][SQL] Force user-specified JDBC driver to take precedenceJosh Rosen2016-01-047-50/+34
| | | | | | | | | | | | | | | | Spark SQL's JDBC data source allows users to specify an explicit JDBC driver to load (using the `driver` argument), but in the current code it's possible that the user-specified driver will not be used when it comes time to actually create a JDBC connection. In a nutshell, the problem is that you might have multiple JDBC drivers on the classpath that claim to be able to handle the same subprotocol, so simply registering the user-provided driver class with the our `DriverRegistry` and JDBC's `DriverManager` is not sufficient to ensure that it's actually used when creating the JDBC connection. This patch addresses this issue by first registering the user-specified driver with the DriverManager, then iterating over the driver manager's loaded drivers in order to obtain the correct driver and use it to create a connection (previously, we just called `DriverManager.getConnection()` directly). If a user did not specify a JDBC driver to use, then we call `DriverManager.getDriver` to figure out the class of the driver to use, then pass that class's name to executors; this guards against corner-case bugs in situations where the driver and executor JVMs might have different sets of JDBC drivers on their classpaths (previously, there was the (rare) potential for `DriverManager.getConnection()` to use different drivers on the driver and executors if the user had not explicitly specified a JDBC driver class and the classpaths were different). This patch is inspired by a similar patch that I made to the `spark-redshift` library (https://github.com/databricks/spark-redshift/pull/143), which contains its own modified fork of some of Spark's JDBC data source code (for cross-Spark-version compatibility reasons). Author: Josh Rosen <joshrosen@databricks.com> Closes #10519 from JoshRosen/jdbc-driver-precedence.
* [SPARK-12486] Worker should kill the executors more forcefully if possible.Nong Li2016-01-043-12/+112
| | | | | | | | | | | | | | This patch updates the ExecutorRunner's terminate path to use the new java 8 API to terminate processes more forcefully if possible. If the executor is unhealthy, it would previously ignore the destroy() call. Presumably, the new java API was added to handle cases like this. We could update the termination path in the future to use OS specific commands for older java versions. Author: Nong Li <nong@databricks.com> Closes #10438 from nongli/spark-12486-executors.
* [SPARK-12513][STREAMING] SocketReceiver hang in Netcat exampleguoxu12312016-01-041-14/+24
| | | | | | | | | Explicitly close client side socket connection before restart socket receiver. Author: guoxu1231 <guoxu1231@gmail.com> Author: Shawn Guo <guoxu1231@gmail.com> Closes #10464 from guoxu1231/SPARK-12513.
* [SPARK-10359][PROJECT-INFRA] Use more random number in ↵Josh Rosen2016-01-042-5/+15
| | | | | | | | | | | | | | dev/test-dependencies.sh; fix version switching This patch aims to fix another potential source of flakiness in the `dev/test-dependencies.sh` script. pwendell's original patch and my version used `$(date +%s | tail -c6)` to generate a suffix to use when installing temporary Spark versions into the local Maven cache, but this value only changes once per second and thus is highly collision-prone when concurrent builds launch on AMPLab Jenkins. In order to reduce the potential for conflicts, this patch updates the script to call Python's random number generator instead. I also fixed a bug in how we captured the original project version; the bug was causing the exit handler code to fail. Author: Josh Rosen <joshrosen@databricks.com> Closes #10558 from JoshRosen/build-dep-tests-round-3.
* [SPARK-12612][PROJECT-INFRA] Add missing Hadoop profiles to ↵Josh Rosen2016-01-035-2/+394
| | | | | | | | | | | | dev/run-tests-*.py scripts and dev/deps There are a couple of places in the `dev/run-tests-*.py` scripts which deal with Hadoop profiles, but the set of profiles that they handle does not include all Hadoop profiles defined in our POM. Similarly, the `hadoop-2.2` and `hadoop-2.6` profiles were missing from `dev/deps`. This patch updates these scripts to include all four Hadoop profiles defined in our POM. Author: Josh Rosen <joshrosen@databricks.com> Closes #10565 from JoshRosen/add-missing-hadoop-profiles-in-test-scripts.
* [SPARK-12562][SQL] DataFrame.write.format(text) requires the column name to ↵Xiu Guo2016-01-032-6/+7
| | | | | | | | be called value Author: Xiu Guo <xguo27@gmail.com> Closes #10515 from xguo27/SPARK-12562.
* [SPARK-12611][SQL][PYSPARK][TESTS] Fix test_infer_schema_to_localHolden Karau2016-01-031-1/+1
| | | | | | | | Previously (when the PR was first created) not specifying b= explicitly was fine (and treated as default null) - instead be explicit about b being None in the test. Author: Holden Karau <holden@us.ibm.com> Closes #10564 from holdenk/SPARK-12611-fix-test-infer-schema-local.
* [SPARK-12537][SQL] Add option to accept quoting of all character backslash ↵Cazen2016-01-034-2/+30
| | | | | | | | | | | | | quoting mechanism We can provides the option to choose JSON parser can be enabled to accept quoting of all character or not. Author: Cazen <Cazen@korea.com> Author: Cazen Lee <cazen.lee@samsung.com> Author: Cazen Lee <Cazen@korea.com> Author: cazen.lee <cazen.lee@samsung.com> Closes #10497 from Cazen/master.
* Update MimaExcludes now Spark 1.6 is in Maven.Reynold Xin2016-01-032-148/+12
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #10561 from rxin/update-mima.
* [SPARK-12533][SQL] hiveContext.table() throws the wrong exceptionthomastechs2016-01-032-4/+4
| | | | | | | | Avoiding the the No such table exception and throwing analysis exception as per the bug: SPARK-12533 Author: thomastechs <thomas.sebastian@tcs.com> Closes #10529 from thomastechs/topic-branch.
* [SPARK-12327][SPARKR] fix code for lintr warning for commented codefelixcheung2016-01-039-11/+88
| | | | | | | | shivaram Author: felixcheung <felixcheung_m@hotmail.com> Closes #10408 from felixcheung/rcodecomment.
* Revert "Revert "[SPARK-12286][SPARK-12290][SPARK-12294][SPARK-12284][SQL] ↵Reynold Xin2016-01-0234-574/+74
| | | | | | always output UnsafeRow"" This reverts commit 44ee920fd49d35b421ae562ea99bcc8f2b98ced6.
* [SPARK-12599][MLLIB][SQL] Remove the use of callUDF in MLlibReynold Xin2016-01-022-2/+16
| | | | | | | | callUDF has been deprecated. However, we do not have an alternative for users to specify the output data type without type tags. This pull request introduced a new API for that, and replaces the invocation of the deprecated callUDF with that. Author: Reynold Xin <rxin@databricks.com> Closes #10547 from rxin/SPARK-12599.
* [SPARK-12481][CORE][STREAMING][SQL] Remove usage of Hadoop deprecated APIs ↵Sean Owen2016-01-0246-441/+150
| | | | | | | | | | and reflection that supported 1.x Remove use of deprecated Hadoop APIs now that 2.2+ is required Author: Sean Owen <sowen@cloudera.com> Closes #10446 from srowen/SPARK-12481.
* [SPARK-10180][SQL] JDBC datasource are not processing EqualNullSafe filterhyukjinkwon2016-01-022-2/+7
| | | | | | | | | | This PR is followed by https://github.com/apache/spark/pull/8391. Previous PR fixes JDBCRDD to support null-safe equality comparison for JDBC datasource. This PR fixes the problem that it can actually return null as a result of the comparison resulting error as using the value of that comparison. Author: hyukjinkwon <gurwls223@gmail.com> Author: HyukjinKwon <gurwls223@gmail.com> Closes #8743 from HyukjinKwon/SPARK-10180.
* [SPARK-12362][SQL][WIP] Inline Hive ParserHerman van Hovell2016-01-0118-73/+5443
| | | | | | | | | | | | | | This PR inlines the Hive SQL parser in Spark SQL. The previous (merged) incarnation of this PR passed all tests, but had and still has problems with the build. These problems are caused by a the fact that - for some reason - in some cases the ANTLR generated code is not included in the compilation fase. This PR is a WIP and should not be merged until we have sorted out the build issues. Author: Herman van Hovell <hvanhovell@questtec.nl> Author: Nong Li <nong@databricks.com> Author: Nong Li <nongli@gmail.com> Closes #10525 from hvanhovell/SPARK-12362.
* Revert "[SPARK-12286][SPARK-12290][SPARK-12294][SPARK-12284][SQL] always ↵Reynold Xin2016-01-0134-74/+574
| | | | | | output UnsafeRow" This reverts commit 0da7bd50ddf0fb9e0e8aeadb9c7fb3edf6f0ee6e.
* [SPARK-12286][SPARK-12290][SPARK-12294][SPARK-12284][SQL] always output ↵Davies Liu2016-01-0134-574/+74
| | | | | | | | | | | | | | | | UnsafeRow It's confusing that some operator output UnsafeRow but some not, easy to make mistake. This PR change to only output UnsafeRow for all the operators (SparkPlan), removed the rule to insert Unsafe/Safe conversions. For those that can't output UnsafeRow directly, added UnsafeProjection into them. Closes #10330 cc JoshRosen rxin Author: Davies Liu <davies@databricks.com> Closes #10511 from davies/unsafe_row.
* Disable test-dependencies.sh.Reynold Xin2016-01-011-2/+3
|
* [SPARK-12592][SQL][TEST] Don't mute Spark loggers in TestHive.reset()Cheng Lian2016-01-011-1/+4
| | | | | | | | There's a hack done in `TestHive.reset()`, which intended to mute noisy Hive loggers. However, Spark testing loggers are also muted. Author: Cheng Lian <lian@databricks.com> Closes #10540 from liancheng/spark-12592.dont-mute-spark-loggers.
* [SPARK-12409][SPARK-12387][SPARK-12391][SQL] Refactor filter pushdown for ↵Liang-Chi Hsieh2016-01-012-31/+45
| | | | | | | | | | | | JDBCRDD and add few filters This patch refactors the filter pushdown for JDBCRDD and also adds few filters. Added filters are basically from #10468 with some refactoring. Test cases are from #10468. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10470 from viirya/refactor-jdbc-filter.
* [SPARK-3873][MLLIB] Import order fixes.Marcelo Vanzin2015-12-3195-169/+160
| | | | | | | | | | | A slight adjustment to the checker configuration was needed; there is a handful of warnings still left, but those are because of a bug in the checker that I'll fix separately (before enabling errors for the checker, of course). Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10535 from vanzin/SPARK-3873-mllib.
* [SPARK-11743][SQL] Move the test for arrayOfUDTLiang-Chi Hsieh2015-12-311-13/+2
| | | | | | | | A following pr for #9712. Move the test for arrayOfUDT. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10538 from viirya/move-udt-test.
* [SPARK-10359][PROJECT-INFRA] Multiple fixes to dev/test-dependencies.sh scriptJosh Rosen2015-12-312-2/+9
| | | | | | | | | | | | | This patch includes multiple fixes for the `dev/test-dependencies.sh` script (which was introduced in #10461): - Use `build/mvn --force` instead of `mvn` in one additional place. - Explicitly set a zero exit code on success. - Set `LC_ALL=C` to make `sort` results agree across machines (see https://stackoverflow.com/questions/28881/). - Set `should_run_build_tests=True` for `build` module (this somehow got lost). Author: Josh Rosen <joshrosen@databricks.com> Closes #10543 from JoshRosen/dep-script-fixes.
* [SPARK-3873][STREAMING] Import order fixes for streaming.Marcelo Vanzin2015-12-3173-180/+181
| | | | | | | | Also included a few miscelaneous other modules that had very few violations. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10532 from vanzin/SPARK-3873-streaming.
* [SPARK-12039][SQL] Re-enable HiveSparkSubmitSuite's SPARK-9757 Persist ↵Yin Huai2015-12-311-1/+1
| | | | | | | | | | | | Parquet relation with decimal column https://issues.apache.org/jira/browse/SPARK-12039 since we do not support hadoop1, we can re-enable this test in master. Author: Yin Huai <yhuai@databricks.com> Closes #10533 from yhuai/SPARK-12039-enable.
* [SPARK-7995][SPARK-6280][CORE] Remove AkkaRpcEnv and remove systemName from ↵Shixiong Zhu2015-12-3129-1120/+90
| | | | | | | | | | | | | | | | | | | | | setupEndpointRef ### Remove AkkaRpcEnv Keep `SparkEnv.actorSystem` because Streaming still uses it. Will remove it and AkkaUtils after refactoring Streaming actorStream API. ### Remove systemName There are 2 places using `systemName`: * `RpcEnvConfig.name`. Actually, although it's used as `systemName` in `AkkaRpcEnv`, `NettyRpcEnv` uses it as the service name to output the log `Successfully started service *** on port ***`. Since the service name in log is useful, I keep `RpcEnvConfig.name`. * `def setupEndpointRef(systemName: String, address: RpcAddress, endpointName: String)`. Each `ActorSystem` has a `systemName`. Akka requires `systemName` in its URI and will refuse a connection if `systemName` is not matched. However, `NettyRpcEnv` doesn't use it. So we can remove `systemName` from `setupEndpointRef` since we are removing `AkkaRpcEnv`. ### Remove RpcEnv.uriOf `uriOf` exists because Akka uses different URI formats for with and without authentication, e.g., `akka.ssl.tcp...` and `akka.tcp://...`. But `NettyRpcEnv` uses the same format. So it's not necessary after removing `AkkaRpcEnv`. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10459 from zsxwing/remove-akka-rpc-env.
* [SPARK-12585] [SQL] move numFields to constructor of UnsafeRowDavies Liu2015-12-3016-137/+86
| | | | | | | | | | Right now, numFields will be passed in by pointTo(), then bitSetWidthInBytes is calculated, making pointTo() a little bit heavy. It should be part of constructor of UnsafeRow. Author: Davies Liu <davies@databricks.com> Closes #10528 from davies/numFields.
* House cleaning: close old pull requests.Reynold Xin2015-12-300-0/+0
| | | | | | | | | | Closes #5400 Closes #5408 Closes #5423 Closes #5668 Closes #6757 Closes #6745 Closes #6613
* Closes #10386 since it was superseded by #10468.Reynold Xin2015-12-300-0/+0
|
* House cleaning: close open pull requests created before June 1st, 2015Reynold Xin2015-12-300-0/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Closes #5358 Closes #3744 Closes #3677 Closes #3536 Closes #3249 Closes #3221 Closes #2446 Closes #3794 Closes #3815 Closes #3816 Closes #3866 Closes #4286 Closes #5184 Closes #5170 Closes #5142 Closes #5025 Closes #5005 Closes #4897 Closes #4887 Closes #4849 Closes #4632 Closes #4622 Closes #4456 Closes #4449 Closes #4417 Closes #5483 Closes #5325 Closes #6545 Closes #6449 Closes #6433 Closes #6416 Closes #6403 Closes #6386 Closes #6263 Closes #6245 Closes #6213 Closes #6155 Closes #6133 Closes #6018 Closes #5978 Closes #5869 Closes #5852 Closes #5848 Closes #5754 Closes #5598 Closes #5503 Closes #4380
* [SPARK-12561] Remove JobLogger in Spark 2.0.Reynold Xin2015-12-301-277/+0
| | | | | | | | It was research code and has been deprecated since 1.0.0. No one really uses it since they can just use event logging. Author: Reynold Xin <rxin@databricks.com> Closes #10530 from rxin/SPARK-12561.
* [SPARK-3873][GRAPHX] Import order fixes.Marcelo Vanzin2015-12-3023-50/+33
| | | | | | | | There's one warning left, caused by a bug in the checker. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10537 from vanzin/SPARK-3873-graphx.
* [SPARK-3873][YARN] Fix import ordering.Marcelo Vanzin2015-12-3012-26/+23
| | | | | | Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10536 from vanzin/SPARK-3873-yarn.
* [SPARK-12588] Remove HttpBroadcast in Spark 2.0.Reynold Xin2015-12-3012-491/+22
| | | | | | | | We switched to TorrentBroadcast in Spark 1.1, and HttpBroadcast has been undocumented since then. It's time to remove it in Spark 2.0. Author: Reynold Xin <rxin@databricks.com> Closes #10531 from rxin/SPARK-12588.
* [SPARK-8641][SPARK-12455][SQL] Native Spark Window functions - Follow-up ↵Herman van Hovell2015-12-303-3/+162
| | | | | | | | | | | | | | | | | | (docs & tests) This PR is a follow-up for PR https://github.com/apache/spark/pull/9819. It adds documentation for the window functions and a couple of NULL tests. The documentation was largely based on the documentation in (the source of) Hive and Presto: * https://prestodb.io/docs/current/functions/window.html * https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics I am not sure if we need to add the licenses of these two projects to the licenses directory. They are both under the ASL. srowen any thoughts? cc yhuai Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10402 from hvanhovell/SPARK-8641-docs.
* [SPARK-12399] Display correct error message when accessing REST API with an ↵Carson Wang2015-12-301-2/+14
| | | | | | | | | | | | | | | | | | | | | | | unknown app Id I got an exception when accessing the below REST API with an unknown application Id. `http://<server-url>:18080/api/v1/applications/xxx/jobs` Instead of an exception, I expect an error message "no such app: xxx" which is a similar error message when I access `/api/v1/applications/xxx` ``` org.spark-project.guava.util.concurrent.UncheckedExecutionException: java.util.NoSuchElementException: no app with key xxx at org.spark-project.guava.cache.LocalCache$Segment.get(LocalCache.java:2263) at org.spark-project.guava.cache.LocalCache.get(LocalCache.java:4000) at org.spark-project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at org.spark-project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) at org.apache.spark.deploy.history.HistoryServer.getSparkUI(HistoryServer.scala:116) at org.apache.spark.status.api.v1.UIRoot$class.withSparkUI(ApiRootResource.scala:226) at org.apache.spark.deploy.history.HistoryServer.withSparkUI(HistoryServer.scala:46) at org.apache.spark.status.api.v1.ApiRootResource.getJobs(ApiRootResource.scala:66) ``` Author: Carson Wang <carson.wang@intel.com> Closes #10352 from carsonwang/unknownAppFix.
* [SPARK-12409][SPARK-12387][SPARK-12391][SQL] Support AND/OR/IN/LIKE ↵Takeshi YAMAMURO2015-12-302-2/+35
| | | | | | | | | | push-down filters for JDBC This is rework from #10386 and add more tests and LIKE push-down support. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #10468 from maropu/SupportMorePushdownInJdbc.
* [SPARK-10359] Enumerate dependencies in a file and diff against it for new ↵Josh Rosen2015-12-3010-120/+512
| | | | | | | | | | | | | | | | | pull requests This patch adds a new build check which enumerates Spark's resolved runtime classpath and saves it to a file, then diffs against that file to detect whether pull requests have introduced dependency changes. The aim of this check is to make it simpler to reason about whether pull request which modify the build have introduced new dependencies or changed transitive dependencies in a way that affects the final classpath. This supplants the checks added in SPARK-4123 / #5093, which are currently disabled due to bugs. This patch is based on pwendell's work in #8531. Closes #8531. Author: Josh Rosen <joshrosen@databricks.com> Author: Patrick Wendell <patrick@databricks.com> Closes #10461 from JoshRosen/SPARK-10359.