aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
...
* [SPARK-17953][DOCUMENTATION] Fix typo in SparkSession scaladocJun Kim2016-10-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? ### Before: ```scala SparkSession.builder() .master("local") .appName("Word Count") .config("spark.some.config.option", "some-value"). .getOrCreate() ``` ### After: ```scala SparkSession.builder() .master("local") .appName("Word Count") .config("spark.some.config.option", "some-value") .getOrCreate() ``` There was one unexpected dot! Author: Jun Kim <i2r.jun@gmail.com> Closes #15498 from tae-jun/SPARK-17953.
* [SPARK-16980][SQL] Load only catalog table partition metadata required to ↵Michael Allman2016-10-1437-368/+914
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | answer a query (This PR addresses https://issues.apache.org/jira/browse/SPARK-16980.) ## What changes were proposed in this pull request? In a new Spark session, when a partitioned Hive table is converted to use Spark's `HadoopFsRelation` in `HiveMetastoreCatalog`, metadata for every partition of that table are retrieved from the metastore and loaded into driver memory. In addition, every partition's metadata files are read from the filesystem to perform schema inference. If a user queries such a table with predicates which prune that table's partitions, we would like to be able to answer that query without consulting partition metadata which are not involved in the query. When querying a table with a large number of partitions for some data from a small number of partitions (maybe even a single partition), the current conversion strategy is highly inefficient. I suspect this scenario is not uncommon in the wild. In addition to being inefficient in running time, the current strategy is inefficient in its use of driver memory. When the sum of the number of partitions of all tables loaded in a driver reaches a certain level (somewhere in the tens of thousands), their cached data exhaust all driver heap memory in the default configuration. I suspect this scenario is less common (in that not too many deployments work with tables with tens of thousands of partitions), however this does illustrate how large the memory footprint of this metadata can be. With tables with hundreds or thousands of partitions, I would expect the `HiveMetastoreCatalog` table cache to represent a significant portion of the driver's heap space. This PR proposes an alternative approach. Basically, it makes four changes: 1. It adds a new method, `listPartitionsByFilter` to the Catalyst `ExternalCatalog` trait which returns the partition metadata for a given sequence of partition pruning predicates. 1. It refactors the `FileCatalog` type hierarchy to include a new `TableFileCatalog` to efficiently return files only for partitions matching a sequence of partition pruning predicates. 1. It removes partition loading and caching from `HiveMetastoreCatalog`. 1. It adds a new Catalyst optimizer rule, `PruneFileSourcePartitions`, which applies a plan's partition-pruning predicates to prune out unnecessary partition files from a `HadoopFsRelation`'s underlying file catalog. The net effect is that when a query over a partitioned Hive table is planned, the analyzer retrieves the table metadata from `HiveMetastoreCatalog`. As part of this operation, the `HiveMetastoreCatalog` builds a `HadoopFsRelation` with a `TableFileCatalog`. It does not load any partition metadata or scan any files. The optimizer prunes-away unnecessary table partitions by sending the partition-pruning predicates to the relation's `TableFileCatalog `. The `TableFileCatalog` in turn calls the `listPartitionsByFilter` method on its external catalog. This queries the Hive metastore, passing along those filters. As a bonus, performing partition pruning during optimization leads to a more accurate relation size estimate. This, along with c481bdf, can lead to automatic, safe application of the broadcast optimization in a join where it might previously have been omitted. ## Open Issues 1. This PR omits partition metadata caching. I can add this once the overall strategy for the cold path is established, perhaps in a future PR. 1. This PR removes and omits partitioned Hive table schema reconciliation. As a result, it fails to find Parquet schema columns with upper case letters because of the Hive metastore's case-insensitivity. This issue may be fixed by #14750, but that PR appears to have stalled. ericl has contributed to this PR a workaround for Parquet wherein schema reconciliation occurs at query execution time instead of planning. Whether ORC requires a similar patch is an open issue. 1. This PR omits an implementation of `listPartitionsByFilter` for the `InMemoryCatalog`. 1. This PR breaks parquet log output redirection during query execution. I can work around this by running `Class.forName("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$")` first thing in a Spark shell session, but I haven't figured out how to fix this properly. ## How was this patch tested? The current Spark unit tests were run, and some ad-hoc tests were performed to validate that only the necessary partition metadata is loaded. Author: Michael Allman <michael@videoamp.com> Author: Eric Liang <ekl@databricks.com> Author: Eric Liang <ekhliang@gmail.com> Closes #14690 from mallman/spark-16980-lazy_partition_fetching.
* [SPARK-17946][PYSPARK] Python crossJoin API similar to ScalaSrinath Shankar2016-10-143-7/+36
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add a crossJoin function to the DataFrame API similar to that in Scala. Joins with no condition (cartesian products) must be specified with the crossJoin API ## How was this patch tested? Added python tests to ensure that an AnalysisException if a cartesian product is specified without crossJoin(), and that cartesian products can execute if specified via crossJoin() (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. Author: Srinath Shankar <srinath@databricks.com> Closes #15493 from srinathshankar/crosspython.
* [SPARK-17900][SQL] Graduate a list of Spark SQL APIs to stableReynold Xin2016-10-149-62/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch graduates a list of Spark SQL APIs and mark them stable. The following are marked stable: Dataset/DataFrame - functions, since 1.3 - ColumnName, since 1.3 - DataFrameNaFunctions, since 1.3.1 - DataFrameStatFunctions, since 1.4 - UserDefinedFunction, since 1.3 - UserDefinedAggregateFunction, since 1.5 - Window and WindowSpec, since 1.4 Data sources: - DataSourceRegister, since 1.5 - RelationProvider, since 1.3 - SchemaRelationProvider, since 1.3 - CreatableRelationProvider, since 1.3 - BaseRelation, since 1.3 - TableScan, since 1.3 - PrunedScan, since 1.3 - PrunedFilteredScan, since 1.3 - InsertableRelation, since 1.3 The following are kept experimental / evolving: Data sources: - CatalystScan (tied to internal logical plans so it is not stable by definition) Structured streaming: - all classes (introduced new in 2.0 and will likely change) Dataset typed operations (introduced in 1.6 and 2.0 and might change, although probability is low) - all typed methods on Dataset - KeyValueGroupedDataset - o.a.s.sql.expressions.javalang.typed - o.a.s.sql.expressions.scalalang.typed - methods that return typed Dataset in SparkSession We should discuss more whether we want to mark Dataset typed operations stable in 2.1. ## How was this patch tested? N/A - just annotation changes. Author: Reynold Xin <rxin@databricks.com> Closes #15469 from rxin/SPARK-17900.
* [SPARK-11775][PYSPARK][SQL] Allow PySpark to register Java UDFJeff Zhang2016-10-145-4/+152
| | | | | | | | | | Currently pyspark can only call the builtin java UDF, but can not call custom java UDF. It would be better to allow that. 2 benefits: * Leverage the power of rich third party java library * Improve the performance. Because if we use python UDF, python daemons will be started on worker which will affect the performance. Author: Jeff Zhang <zjffdu@apache.org> Closes #9766 from zjffdu/SPARK-11775.
* [SPARK-16063][SQL] Add storageLevel to DatasetNick Pentreath2016-10-143-16/+68
| | | | | | | | | | | | [SPARK-11905](https://issues.apache.org/jira/browse/SPARK-11905) added support for `persist`/`cache` for `Dataset`. However, there is no user-facing API to check if a `Dataset` is cached and if so what the storage level is. This PR adds `getStorageLevel` to `Dataset`, analogous to `RDD.getStorageLevel`. Updated `DatasetCacheSuite`. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13780 from MLnick/ds-storagelevel. Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-17863][SQL] should not add column into DistinctDavies Liu2016-10-142-0/+26
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? We are trying to resolve the attribute in sort by pulling up some column for grandchild into child, but that's wrong when the child is Distinct, because the added column will change the behavior of Distinct, we should not do that. ## How was this patch tested? Added regression test. Author: Davies Liu <davies@databricks.com> Closes #15489 from davies/order_distinct.
* Revert "[SPARK-17620][SQL] Determine Serde by hive.default.fileformat when ↵Yin Huai2016-10-143-60/+9
| | | | | | Creating Hive Serde Tables" This reverts commit 7ab86244e30ca81eb4fa40ea77b4c2b8881cbab2.
* [SPARK-17620][SQL] Determine Serde by hive.default.fileformat when Creating ↵Dilip Biswal2016-10-143-9/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hive Serde Tables ## What changes were proposed in this pull request? Make sure the hive.default.fileformat is used to when creating the storage format metadata. Output ``` SQL scala> spark.sql("SET hive.default.fileformat=orc") res1: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("CREATE TABLE tmp_default(id INT)") res2: org.apache.spark.sql.DataFrame = [] ``` Before ```SQL scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println) .. [# Storage Information,,] [SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,] [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,] [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,] [Compressed:,No,] [Storage Desc Parameters:,,] [ serialization.format,1,] ``` After ```SQL scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println) .. [# Storage Information,,] [SerDe Library:,org.apache.hadoop.hive.ql.io.orc.OrcSerde,] [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,] [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,] [Compressed:,No,] [Storage Desc Parameters:,,] [ serialization.format,1,] ``` ## How was this patch tested? Added new tests to HiveDDLCommandSuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #15190 from dilipbiswal/orc.
* [SPARK-17941][ML][TEST] Logistic regression tests should use sample weights.sethah2016-10-141-745/+748
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? The sample weight testing for logistic regressions is not robust. Logistic regression suite already has many test cases comparing results to R glmnet. Since both libraries support sample weights, we should use sample weights in the test to increase coverage for sample weighting. This patch doesn't really add any code and makes the testing more complete. Also fixed some errors with the R code that was referenced in the test suit. Changed `standardization=T` to `standardize=T` since the former is invalid. ## How was this patch tested? Existing unit tests are modified. No non-test code is touched. Author: sethah <seth.hendrickson16@gmail.com> Closes #15488 from sethah/logreg_weight_tests.
* [TEST] Ignore flaky test in StreamingQueryListenerSuiteTathagata Das2016-10-141-1/+1
| | | | | | | | | | | | ## What changes were proposed in this pull request? Ignoring the flaky test introduced in #15307 https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/1736/testReport/junit/org.apache.spark.sql.streaming/StreamingQueryListenerSuite/single_listener__check_trigger_statuses/ Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #15491 from tdas/metrics-flaky-test.
* Typo: form -> fromAndrew Ash2016-10-141-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Minor typo fix ## How was this patch tested? Existing unit tests on Jenkins Author: Andrew Ash <andrew@andrewash.com> Closes #15486 from ash211/patch-8.
* [DOC] Fix typo in sql hive docDhruve Ashar2016-10-141-1/+1
| | | | | | | | Change is too trivial to file a JIRA. Author: Dhruve Ashar <dhruveashar@gmail.com> Closes #15485 from dhruve/master.
* [SPARK-17073][SQL][FOLLOWUP] generate column-level statisticswangzhenhua2016-10-143-57/+197
| | | | | | | | | | | | ## What changes were proposed in this pull request? This pr adds some test cases for statistics: case sensitive column names, non ascii column names, refresh table, and also improves some documentation. ## How was this patch tested? add test cases Author: wangzhenhua <wangzhenhua@huawei.com> Closes #15360 from wzhfy/colStats2.
* [SPARK-17855][CORE] Remove query string from jar urlinvkrh2016-10-141-15/+9
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Spark-submit support jar url with http protocol. However, if the url contains any query strings, `worker.DriverRunner.downloadUserJar()` method will throw "Did not see expected jar" exception. This is because this method checks the existance of a downloaded jar whose name contains query strings. This is a problem when your jar is located on some web service which requires some additional information to retrieve the file. This pr just removes query strings before checking jar existance on worker. ## How was this patch tested? For now, you can only test this patch by manual test. * Deploy a spark cluster locally * Make sure apache httpd service is on * Save an uber jar, e.g spark-job.jar under `/var/www/html/` * Use http://localhost/spark-job.jar?param=1 as jar url when running `spark-submit` * Job should be launched Author: invkrh <invkrh@gmail.com> Closes #15420 from invkrh/spark-17855.
* [SPARK-17870][MLLIB][ML] Change statistic to pValue for SelectKBest and ↵Peng2016-10-145-15/+15
| | | | | | | | | | | | | | | | | SelectPercentile because of DoF difference ## What changes were proposed in this pull request? For feature selection method ChiSquareSelector, it is based on the ChiSquareTestResult.statistic (ChiSqure value) to select the features. It select the features with the largest ChiSqure value. But the Degree of Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and for different df, you cannot base on ChiSqure value to select features. So we change statistic to pValue for SelectKBest and SelectPercentile ## How was this patch tested? change existing test Author: Peng <peng.meng@intel.com> Closes #15444 from mpjlu/chisqure-bug.
* [SPARK-14634][ML] Add BisectingKMeansSummaryZheng RuiFeng2016-10-144-5/+91
| | | | | | | | | | | | ## What changes were proposed in this pull request? Add BisectingKMeansSummary ## How was this patch tested? unit test Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12394 from zhengruifeng/biKMSummary.
* [SPARK-15402][ML][PYSPARK] PySpark ml.evaluation should support save/loadYanbo Liang2016-10-141-9/+36
| | | | | | | | | | | | ## What changes were proposed in this pull request? Since ```ml.evaluation``` has supported save/load at Scala side, supporting it at Python side is very straightforward and easy. ## How was this patch tested? Add python doctest. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13194 from yanboliang/spark-15402.
* [SPARK-17903][SQL] MetastoreRelation should talk to external catalog instead ↵Wenchen Fan2016-10-1411-32/+66
| | | | | | | | | | | | | | | | of hive client ## What changes were proposed in this pull request? `HiveExternalCatalog` should be the only interface to talk to the hive metastore. In `MetastoreRelation` we can just use `ExternalCatalog` instead of `HiveClient` to interact with hive metastore, and add missing API in `ExternalCatalog`. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #15460 from cloud-fan/relation.
* [SPARK-17925][SQL] Break fileSourceInterfaces.scala into multiple piecesReynold Xin2016-10-145-284/+359
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch does a few changes to the file structure of data sources: - Break fileSourceInterfaces.scala into multiple pieces (HadoopFsRelation, FileFormat, OutputWriter) - Move ParquetOutputWriter into its own file I created this as a separate patch so it'd be easier to review my future PRs that focus on refactoring this internal logic. This patch only moves code around, and has no logic changes. ## How was this patch tested? N/A - should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #15473 from rxin/SPARK-17925.
* [SPARK-17927][SQL] Remove dead code in WriterContainer.Reynold Xin2016-10-141-11/+0
| | | | | | | | | | | | ## What changes were proposed in this pull request? speculationEnabled and DATASOURCE_OUTPUTPATH seem like just dead code. ## How was this patch tested? Tests should fail if they are not dead code. Author: Reynold Xin <rxin@databricks.com> Closes #15477 from rxin/SPARK-17927.
* [SPARK-15957][FOLLOW-UP][ML][PYSPARK] Add Python API for RFormula ↵Yanbo Liang2016-10-132-4/+43
| | | | | | | | | | | | | | forceIndexLabel. ## What changes were proposed in this pull request? Follow-up work of #13675, add Python API for ```RFormula forceIndexLabel```. ## How was this patch tested? Unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15430 from yanboliang/spark-15957-python.
* [SPARK-17368][SQL] Add support for value class serialization and deserializationJakob Odersky2016-10-132-3/+13
| | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Value classes were unsupported because catalyst data types were obtained through reflection on erased types, which would resolve to a value class' wrapped type and hence lead to unavailable methods during code generation. E.g. the following class ```scala case class Foo(x: Int) extends AnyVal ``` would be seen as an `int` in catalyst and will cause instance cast failures when generated java code tries to treat it as a `Foo`. This patch simply removes the erasure step when getting data types for catalyst. ## How was this patch tested? Additional tests in `ExpressionEncoderSuite`. Author: Jakob Odersky <jakob@odersky.com> Closes #15284 from jodersky/value-classes.
* [SPARK-17661][SQL] Consolidate various listLeafFiles implementationspetermaxlee2016-10-134-224/+206
| | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? There are 4 listLeafFiles-related functions in Spark: - ListingFileCatalog.listLeafFiles (which calls HadoopFsRelation.listLeafFilesInParallel if the number of paths passed in is greater than a threshold; if it is lower, then it has its own serial version implemented) - HadoopFsRelation.listLeafFiles (called only by HadoopFsRelation.listLeafFilesInParallel) - HadoopFsRelation.listLeafFilesInParallel (called only by ListingFileCatalog.listLeafFiles) It is actually very confusing and error prone because there are effectively two distinct implementations for the serial version of listing leaf files. As an example, SPARK-17599 updated only one of the code path and ignored the other one. This code can be improved by: - Move all file listing code into ListingFileCatalog, since it is the only class that needs this. - Keep only one function for listing files in serial. ## How was this patch tested? This change should be covered by existing unit and integration tests. I also moved a test case for HadoopFsRelation.shouldFilterOut from HadoopFsRelationSuite to ListingFileCatalogSuite. Author: petermaxlee <petermaxlee@gmail.com> Closes #15235 from petermaxlee/SPARK-17661.
* [SPARK-17731][SQL][STREAMING] Metrics for structured streamingTathagata Das2016-10-1327-273/+1758
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Metrics are needed for monitoring structured streaming apps. Here is the design doc for implementing the necessary metrics. https://docs.google.com/document/d/1NIdcGuR1B3WIe8t7VxLrt58TJB4DtipWEbj5I_mzJys/edit?usp=sharing Specifically, this PR adds the following public APIs changes. ### New APIs - `StreamingQuery.status` returns a `StreamingQueryStatus` object (renamed from `StreamingQueryInfo`, see later) - `StreamingQueryStatus` has the following important fields - inputRate - Current rate (rows/sec) at which data is being generated by all the sources - processingRate - Current rate (rows/sec) at which the query is processing data from all the sources - ~~outputRate~~ - *Does not work with wholestage codegen* - latency - Current average latency between the data being available in source and the sink writing the corresponding output - sourceStatuses: Array[SourceStatus] - Current statuses of the sources - sinkStatus: SinkStatus - Current status of the sink - triggerStatus - Low-level detailed status of the last completed/currently active trigger - latencies - getOffset, getBatch, full trigger, wal writes - timestamps - trigger start, finish, after getOffset, after getBatch - numRows - input, output, state total/updated rows for aggregations - `SourceStatus` has the following important fields - inputRate - Current rate (rows/sec) at which data is being generated by the source - processingRate - Current rate (rows/sec) at which the query is processing data from the source - triggerStatus - Low-level detailed status of the last completed/currently active trigger - Python API for `StreamingQuery.status()` ### Breaking changes to existing APIs **Existing direct public facing APIs** - Deprecated direct public-facing APIs `StreamingQuery.sourceStatuses` and `StreamingQuery.sinkStatus` in favour of `StreamingQuery.status.sourceStatuses/sinkStatus`. - Branch 2.0 should have it deprecated, master should have it removed. **Existing advanced listener APIs** - `StreamingQueryInfo` renamed to `StreamingQueryStatus` for consistency with `SourceStatus`, `SinkStatus` - Earlier StreamingQueryInfo was used only in the advanced listener API, but now it is used in direct public-facing API (StreamingQuery.status) - Field `queryInfo` in listener events `QueryStarted`, `QueryProgress`, `QueryTerminated` changed have name `queryStatus` and return type `StreamingQueryStatus`. - Field `offsetDesc` in `SourceStatus` was Option[String], converted it to `String`. - For `SourceStatus` and `SinkStatus` made constructor private instead of private[sql] to make them more java-safe. Instead added `private[sql] object SourceStatus/SinkStatus.apply()` which are harder to accidentally use in Java. ## How was this patch tested? Old and new unit tests. - Rate calculation and other internal logic of StreamMetrics tested by StreamMetricsSuite. - New info in statuses returned through StreamingQueryListener is tested in StreamingQueryListenerSuite. - New and old info returned through StreamingQuery.status is tested in StreamingQuerySuite. - Source-specific tests for making sure input rows are counted are is source-specific test suites. - Additional tests to test minor additions in LocalTableScanExec, StateStore, etc. Metrics also manually tested using Ganglia sink Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #15307 from tdas/SPARK-17731.
* [SPARK-17834][SQL] Fetch the earliest offsets manually in KafkaSource ↵Shixiong Zhu2016-10-132-26/+48
| | | | | | | | | | | | | | | | instead of counting on KafkaConsumer ## What changes were proposed in this pull request? Because `KafkaConsumer.poll(0)` may update the partition offsets, this PR just calls `seekToBeginning` to manually set the earliest offsets for the KafkaSource initial offsets. ## How was this patch tested? Existing tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #15397 from zsxwing/SPARK-17834.
* [SPARK-17827][SQL] maxColLength type should be Int for String and BinaryPete Robbins2016-10-133-7/+7
| | | | | | | | | | | | ## What changes were proposed in this pull request? correct the expected type from Length function to be Int ## How was this patch tested? Test runs on little endian and big endian platforms Author: Pete Robbins <robbinspg@gmail.com> Closes #15464 from robbinspg/SPARK-17827.
* [SPARK-17830][SQL] Annotate remaining SQL APIs with InterfaceStabilityReynold Xin2016-10-1335-122/+150
| | | | | | | | | | | | ## What changes were proposed in this pull request? This patch annotates all the remaining APIs in SQL (excluding streaming) with InterfaceStability. ## How was this patch tested? N/A - just annotation change. Author: Reynold Xin <rxin@databricks.com> Closes #15457 from rxin/SPARK-17830-2.
* [SPARK-17657][SQL] Disallow Users to Change Table Typegatorsmile2016-10-132-0/+37
| | | | | | | | | | | | | | | | ### What changes were proposed in this pull request? Hive allows users to change the table type from `Managed` to `External` or from `External` to `Managed` by altering table's property `EXTERNAL`. See the JIRA: https://issues.apache.org/jira/browse/HIVE-1329 So far, Spark SQL does not correctly support it, although users can do it. Many assumptions are broken in the implementation. Thus, this PR is to disallow users to change it. In addition, we also do not allow users to set the property `EXTERNAL` when creating a table. ### How was this patch tested? Added test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #15230 from gatorsmile/alterTableSetExternal.
* [SPARK-17686][CORE] Support printing out scala and java version with ↵jerryshao2016-10-131-1/+3
| | | | | | | | | | | | | | | | | | spark-submit --version command ## What changes were proposed in this pull request? In our universal gateway service we need to specify different jars to Spark according to scala version. For now only after launching Spark application can we know which version of Scala it depends on. It makes hard for us to support different Scala + Spark versions to pick the right jars. So here propose to print out Scala version according to Spark version in "spark-submit --version", so that user could leverage this output to make the choice without needing to launching application. ## How was this patch tested? Manually verified in local environment. Author: jerryshao <sshao@hortonworks.com> Closes #15456 from jerryshao/SPARK-17686.
* [SPARK-17899][SQL] add a debug mode to keep raw table properties in ↵Wenchen Fan2016-10-134-13/+42
| | | | | | | | | | | | | | | | | | | | HiveExternalCatalog ## What changes were proposed in this pull request? Currently `HiveExternalCatalog` will filter out the Spark SQL internal table properties, e.g. `spark.sql.sources.provider`, `spark.sql.sources.schema`, etc. This is reasonable for external users as they don't want to see these internal properties in `DESC TABLE`. However, as a Spark developer, sometimes we do wanna see the raw table properties. This PR adds a new internal SQL conf, `spark.sql.debug`, to enable debug mode and keep these raw table properties. This config can also be used in similar places where we wanna retain debug information in the future. ## How was this patch tested? new test in MetastoreDataSourcesSuite Author: Wenchen Fan <wenchen@databricks.com> Closes #15458 from cloud-fan/debug.
* [SPARK-11272][WEB UI] Add support for downloading event logs from ↵Alex Bozarth2016-10-132-1/+7
| | | | | | | | | | | | | | | | | | | | HistoryServer UI ## What changes were proposed in this pull request? This is a reworked PR based on feedback in #9238 after it was closed and not reopened. As suggested in that PR I've only added the download feature. This functionality already exists in the api and this allows easier access to download event logs to share with others. I've attached a screenshot of the committed version, but I will also include alternate options with screen shots in the comments below. I'm personally not sure which option is best. ## How was this patch tested? Manual testing ![screen shot 2016-10-07 at 6 11 12 pm](https://cloud.githubusercontent.com/assets/13952758/19209213/832fe48e-8cba-11e6-9840-749b1be4d399.png) Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #15400 from ajbozarth/spark11272.
* minor doc fix for Row.scalabuzhihuojie2016-10-121-1/+1
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? minor doc fix for "getAnyValAs" in class Row ## How was this patch tested? None. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: buzhihuojie <ren.weiluo@gmail.com> Closes #15452 from david-weiluo-ren/minorDocFixForRow.
* [SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropduplicatesLiang-Chi Hsieh2016-10-132-4/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Two issues regarding Dataset.dropduplicates: 1. Dataset.dropDuplicates should consider the columns with same column name We find and get the first resolved attribute from output with the given column name in `Dataset.dropDuplicates`. When we have the more than one columns with the same name. Other columns are put into aggregation columns, instead of grouping columns. 2. Dataset.dropDuplicates should not change the output of child plan We create new `Alias` with new exprId in `Dataset.dropDuplicates` now. However it causes problem when we want to select the columns as follows: val ds = Seq(("a", 1), ("a", 2), ("b", 1), ("a", 1)).toDS() // ds("_2") will cause analysis exception ds.dropDuplicates("_1").select(ds("_1").as[String], ds("_2").as[Int]) Because the two issues are both related to `Dataset.dropduplicates` and the code changes are not big, so submitting them together as one PR. ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #15427 from viirya/fix-dropduplicates.
* [SPARK-17876] Write StructuredStreaming WAL to a stream instead of ↵Burak Yavuz2016-10-123-27/+38
| | | | | | | | | | | | | | | | | | | | | | | | | materializing all at once ## What changes were proposed in this pull request? The CompactibleFileStreamLog materializes the whole metadata log in memory as a String. This can cause issues when there are lots of files that are being committed, especially during a compaction batch. You may come across stacktraces that look like: ``` java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.lang.StringCoding.encode(StringCoding.java:350) at java.lang.String.getBytes(String.java:941) at org.apache.spark.sql.execution.streaming.FileStreamSinkLog.serialize(FileStreamSinkLog.scala:127) ``` The safer way is to write to an output stream so that we don't have to materialize a huge string. ## How was this patch tested? Existing unit tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #15437 from brkyvz/ser-to-stream.
* [SPARK-17835][ML][MLLIB] Optimize NaiveBayes mllib wrapper to eliminate ↵Yanbo Liang2016-10-122-18/+43
| | | | | | | | | | | | | | | | | | | | extra pass on data ## What changes were proposed in this pull request? [SPARK-14077](https://issues.apache.org/jira/browse/SPARK-14077) copied the ```NaiveBayes``` implementation from mllib to ml and left mllib as a wrapper. However, there are some difference between mllib and ml to handle labels: * mllib allow input labels as {-1, +1}, however, ml assumes the input labels in range [0, numClasses). * mllib ```NaiveBayesModel``` expose ```labels``` but ml did not due to the assumption mention above. During the copy in [SPARK-14077](https://issues.apache.org/jira/browse/SPARK-14077), we use ```val labels = data.map(_.label).distinct().collect().sorted``` to get the distinct labels firstly, and then encode the labels for training. It involves extra Spark job compared with the original implementation. Since ```NaiveBayes``` only do one pass aggregation during training, adding another one seems less efficient. We can get the labels in a single pass along with ```NaiveBayes``` training and send them to MLlib side. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15402 from yanboliang/spark-17835.
* [SPARK-17745][ML][PYSPARK] update NB python api - add weight col parameterWeichenXu2016-10-121-13/+13
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? update python api for NaiveBayes: add weight col parameter. ## How was this patch tested? doctests added. Author: WeichenXu <WeichenXu123@outlook.com> Closes #15406 from WeichenXu123/nb_python_update.
* [SPARK-17845] [SQL] More self-evident window function frame boundary APIReynold Xin2016-10-125-51/+160
| | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch improves the window function frame boundary API to make it more obvious to read and to use. The two high level changes are: 1. Create Window.currentRow, Window.unboundedPreceding, Window.unboundedFollowing to indicate the special values in frame boundaries. These methods map to the special integral values so we are not breaking backward compatibility here. This change makes the frame boundaries more self-evident (instead of Long.MinValue, it becomes Window.unboundedPreceding). 2. In Python, for any value less than or equal to JVM's Long.MinValue, treat it as Window.unboundedPreceding. For any value larger than or equal to JVM's Long.MaxValue, treat it as Window.unboundedFollowing. Before this change, if the user specifies any value that is less than Long.MinValue but not -sys.maxsize (e.g. -sys.maxsize + 1), the number we pass over to the JVM would overflow, resulting in a frame that does not make sense. Code example required to specify a frame before this patch: ``` Window.rowsBetween(-Long.MinValue, 0) ``` While the above code should still work, the new way is more obvious to read: ``` Window.rowsBetween(Window.unboundedPreceding, Window.currentRow) ``` ## How was this patch tested? - Updated DataFrameWindowSuite (for Scala/Java) - Updated test_window_functions_cumulative_sum (for Python) - Renamed DataFrameWindowSuite DataFrameWindowFunctionsSuite to better reflect its purpose Author: Reynold Xin <rxin@databricks.com> Closes #15438 from rxin/SPARK-17845.
* [SPARK-17782][STREAMING][KAFKA] alternative eliminate race condition of poll ↵cody koeninger2016-10-123-7/+32
| | | | | | | | | | | | twice ## What changes were proposed in this pull request? Alternative approach to https://github.com/apache/spark/pull/15387 Author: cody koeninger <cody@koeninger.org> Closes #15401 from koeninger/SPARK-17782-alt.
* [SPARK-17675][CORE] Expand Blacklist for TaskSetsImran Rashid2016-10-1217-198/+964
| | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This is a step along the way to SPARK-8425. To enable incremental review, the first step proposed here is to expand the blacklisting within tasksets. In particular, this will enable blacklisting for * (task, executor) pairs (this already exists via an undocumented config) * (task, node) * (taskset, executor) * (taskset, node) Adding (task, node) is critical to making spark fault-tolerant of one-bad disk in a cluster, without requiring careful tuning of "spark.task.maxFailures". The other additions are also important to avoid many misleading task failures and long scheduling delays when there is one bad node on a large cluster. Note that some of the code changes here aren't really required for just this -- they put pieces in place for SPARK-8425 even though they are not used yet (eg. the `BlacklistTracker` helper is a little out of place, `TaskSetBlacklist` holds onto a little more info than it needs to for just this change, and `ExecutorFailuresInTaskSet` is more complex than it needs to be). ## How was this patch tested? Added unit tests, run tests via jenkins. Author: Imran Rashid <irashid@cloudera.com> Author: mwws <wei.mao@intel.com> Closes #15249 from squito/taskset_blacklist_only.
* [SPARK-17850][CORE] Add a flag to ignore corrupt filesShixiong Zhu2016-10-127-7/+153
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add a flag to ignore corrupt files. For Spark core, the configuration is `spark.files.ignoreCorruptFiles`. For Spark SQL, it's `spark.sql.files.ignoreCorruptFiles`. ## How was this patch tested? The added unit tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #15422 from zsxwing/SPARK-17850.
* [BUILD] Closing stale PRsSean Owen2016-10-120-0/+0
| | | | | | | | | | | | | | Closes #15303 Closes #15078 Closes #15080 Closes #15135 Closes #14565 Closes #12355 Closes #15404 Author: Sean Owen <sowen@cloudera.com> Closes #15451 from srowen/CloseStalePRs.
* [SPARK-17840][DOCS] Add some pointers for wiki/CONTRIBUTING.md in README.md ↵Sean Owen2016-10-122-3/+6
| | | | | | | | | | | | | | | | and some warnings in PULL_REQUEST_TEMPLATE ## What changes were proposed in this pull request? Link to contributing wiki in PR template, README.md ## How was this patch tested? Doc-only change, tested by Jekyll Author: Sean Owen <sowen@cloudera.com> Closes #15429 from srowen/SPARK-17840.
* [SPARK-17790][SPARKR] Support for parallelizing R data.frame larger than 2GBHossein2016-10-124-3/+68
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? If the R data structure that is being parallelized is larger than `INT_MAX` we use files to transfer data to JVM. The serialization protocol mimics Python pickling. This allows us to simply call `PythonRDD.readRDDFromFile` to create the RDD. I tested this on my MacBook. Following code works with this patch: ```R intMax <- .Machine$integer.max largeVec <- 1:intMax rdd <- SparkR:::parallelize(sc, largeVec, 2) ``` ## How was this patch tested? * [x] Unit tests Author: Hossein <hossein@databricks.com> Closes #15375 from falaki/SPARK-17790.
* [SPARK-17884][SQL] To resolve Null pointer exception when casting from empty ↵prigarg2016-10-122-1/+7
| | | | | | | | | | | | | | | | | | string to interval type. ## What changes were proposed in this pull request? This change adds a check in castToInterval method of Cast expression , such that if converted value is null , then isNull variable should be set to true. Earlier, the expression Cast(Literal(), CalendarIntervalType) was throwing NullPointerException because of the above mentioned reason. ## How was this patch tested? Added test case in CastSuite.scala jira entry for detail: https://issues.apache.org/jira/browse/SPARK-17884 Author: prigarg <prigarg@adobe.com> Closes #15449 from priyankagargnitk/SPARK-17884.
* [SPARK-14761][SQL] Reject invalid join methods when join columns are not ↵Bijay Pathak2016-10-122-16/+21
| | | | | | | | | | | | | | | | | | | | specified in PySpark DataFrame join. ## What changes were proposed in this pull request? In PySpark, the invalid join type will not throw error for the following join: ```df1.join(df2, how='not-a-valid-join-type')``` The signature of the join is: ```def join(self, other, on=None, how=None):``` The existing code completely ignores the `how` parameter when `on` is `None`. This patch will process the arguments passed to join and pass in to JVM Spark SQL Analyzer, which will validate the join type passed. ## How was this patch tested? Used manual and existing test suites. Author: Bijay Pathak <bkpathak@mtu.edu> Closes #15409 from bkpathak/SPARK-14761.
* [SPARK-11560][MLLIB] Optimize KMeans implementation / remove 'runs'Sean Owen2016-10-121-164/+132
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This is a revival of https://github.com/apache/spark/pull/14948 and related to https://github.com/apache/spark/pull/14937. This removes the 'runs' parameter, which has already been disabled, from the K-means implementation and further deprecates API methods that involve it. This also happens to resolve the issue that K-means should not return duplicate centers, meaning that it may return less than k centroids if not enough data is available. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #15342 from srowen/SPARK-11560.
* [SPARK-17853][STREAMING][KAFKA][DOC] make it clear that reusing group.id is badcody koeninger2016-10-121-2/+5
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Documentation fix to make it clear that reusing group id for different streams is super duper bad, just like it is with the underlying Kafka consumer. ## How was this patch tested? I built jekyll doc and made sure it looked ok. Author: cody koeninger <cody@koeninger.org> Closes #15442 from koeninger/SPARK-17853.
* [SPARK-17880][DOC] The url linking to `AccumulatorV2` in the document is ↵Kousuke Saruta2016-10-111-1/+1
| | | | | | | | | | | | | | | incorrect. ## What changes were proposed in this pull request? In `programming-guide.md`, the url which links to `AccumulatorV2` says `api/scala/index.html#org.apache.spark.AccumulatorV2` but `api/scala/index.html#org.apache.spark.util.AccumulatorV2` is correct. ## How was this patch tested? manual test. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #15439 from sarutak/SPARK-17880.
* Fix hadoop.version in building-spark.mdAlexander Pivovarov2016-10-111-2/+2
| | | | | | | | Couple of mvn build examples use `-Dhadoop.version=VERSION` instead of actual version number Author: Alexander Pivovarov <apivovarov@gmail.com> Closes #15440 from apivovarov/patch-1.