aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-10258][DOC][ML] Add @Since annotations to ml.featureNick Pentreath2016-06-2128-68/+362
| | | | | | | | | | | | | | This PR adds missing `Since` annotations to `ml.feature` package. Closes #8505. ## How was this patch tested? Existing tests. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13641 from MLnick/add-since-annotations.
* Revert "[SPARK-16086] [SQL] fix Python UDF without arguments (for 1.6)"Xiangrui Meng2016-06-212-8/+6
| | | | This reverts commit a46553cbacf0e4012df89fe55385dec5beaa680a.
* [SPARK-15319][SPARKR][DOCS] Fix SparkR doc layout for corr and other ↵Felix Cheung2016-06-212-23/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | DataFrame stats functions ## What changes were proposed in this pull request? Doc only changes. Please see screenshots. Before: http://spark.apache.org/docs/latest/api/R/statfunctions.html ![image](https://cloud.githubusercontent.com/assets/8969467/15264110/cd458826-1924-11e6-85bd-8ee2e2e1a85f.png) After ![image](https://cloud.githubusercontent.com/assets/8969467/16218452/b9e89f08-3732-11e6-969d-a3a1796e7ad0.png) (please ignore the style differences - this is due to not having the css in my local copy) This is still a bit weird. As discussed in SPARK-15237, I think the better approach is to separate out the DataFrame stats function instead of putting everything on one page. At least now it is clearer which description is on which function. ## How was this patch tested? Build doc Author: Felix Cheung <felixcheung_m@hotmail.com> Author: felixcheung <felixcheung_m@hotmail.com> Closes #13109 from felixcheung/rstatdoc.
* [SPARKR][DOCS] R code doc cleanupFelix Cheung2016-06-208-84/+70
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? I ran a full pass from A to Z and fixed the obvious duplications, improper grouping etc. There are still more doc issues to be cleaned up. ## How was this patch tested? manual tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13798 from felixcheung/rdocseealso.
* [SPARK-15894][SQL][DOC] Update docs for controlling #partitionsTakeshi YAMAMURO2016-06-211-0/+17
| | | | | | | | | | | | ## What changes were proposed in this pull request? Update docs for two parameters `spark.sql.files.maxPartitionBytes` and `spark.sql.files.openCostInBytes ` in Other Configuration Options. ## How was this patch tested? N/A Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #13797 from maropu/SPARK-15894-2.
* [SPARK-15863][SQL][DOC][SPARKR] sql programming guide updates to include ↵Felix Cheung2016-06-212-19/+17
| | | | | | | | | | | | | | | | | | sparkSession in R ## What changes were proposed in this pull request? Update doc as per discussion in PR #13592 ## How was this patch tested? manual shivaram liancheng Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13799 from felixcheung/rsqlprogrammingguide.
* [SPARK-16025][CORE] Document OFF_HEAP storage level in 2.0Eric Liang2016-06-201-0/+5
| | | | | | | | This has changed from 1.6, and now stores memory off-heap using spark's off-heap support instead of in tachyon. Author: Eric Liang <ekl@databricks.com> Closes #13744 from ericl/spark-16025.
* [SPARK-16044][SQL] input_file_name() returns empty strings in data sources ↵hyukjinkwon2016-06-203-3/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | based on NewHadoopRDD ## What changes were proposed in this pull request? This PR makes `input_file_name()` function return the file paths not empty strings for external data sources based on `NewHadoopRDD`, such as [spark-redshift](https://github.com/databricks/spark-redshift/blob/cba5eee1ab79ae8f0fa9e668373a54d2b5babf6b/src/main/scala/com/databricks/spark/redshift/RedshiftRelation.scala#L149) and [spark-xml](https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/util/XmlFile.scala#L39-L47). The codes with the external data sources below: ```scala df.select(input_file_name).show() ``` will produce - **Before** ``` +-----------------+ |input_file_name()| +-----------------+ | | +-----------------+ ``` - **After** ``` +--------------------+ | input_file_name()| +--------------------+ |file:/private/var...| +--------------------+ ``` ## How was this patch tested? Unit tests in `ColumnExpressionSuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #13759 from HyukjinKwon/SPARK-16044.
* [SPARK-16074][MLLIB] expose VectorUDT/MatrixUDT in a public APIXiangrui Meng2016-06-203-0/+93
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Both VectorUDT and MatrixUDT are private APIs, because UserDefinedType itself is private in Spark. However, in order to let developers implement their own transformers and estimators, we should expose both types in a public API to simply the implementation of transformSchema, transform, etc. Otherwise, they need to get the data types using reflection. ## How was this patch tested? Unit tests in Scala and Java. Author: Xiangrui Meng <meng@databricks.com> Closes #13789 from mengxr/SPARK-16074.
* [SPARK-16056][SPARK-16057][SPARK-16058][SQL] Fix Multiple Bugs in Column ↵gatorsmile2016-06-202-15/+98
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Partitioning in JDBC Source #### What changes were proposed in this pull request? This PR is to fix the following bugs: **Issue 1: Wrong Results when lowerBound is larger than upperBound in Column Partitioning** ```scala spark.read.jdbc( url = urlWithUserAndPass, table = "TEST.seq", columnName = "id", lowerBound = 4, upperBound = 0, numPartitions = 3, connectionProperties = new Properties) ``` **Before code changes:** The returned results are wrong and the generated partitions are wrong: ``` Part 0 id < 3 or id is null Part 1 id >= 3 AND id < 2 Part 2 id >= 2 ``` **After code changes:** Issue an `IllegalArgumentException` exception: ``` Operation not allowed: the lower bound of partitioning column is larger than the upper bound. lowerBound: 5; higherBound: 1 ``` **Issue 2: numPartitions is more than the number of key values between upper and lower bounds** ```scala spark.read.jdbc( url = urlWithUserAndPass, table = "TEST.seq", columnName = "id", lowerBound = 1, upperBound = 5, numPartitions = 10, connectionProperties = new Properties) ``` **Before code changes:** Returned correct results but the generated partitions are very inefficient, like: ``` Partition 0: id < 1 or id is null Partition 1: id >= 1 AND id < 1 Partition 2: id >= 1 AND id < 1 Partition 3: id >= 1 AND id < 1 Partition 4: id >= 1 AND id < 1 Partition 5: id >= 1 AND id < 1 Partition 6: id >= 1 AND id < 1 Partition 7: id >= 1 AND id < 1 Partition 8: id >= 1 AND id < 1 Partition 9: id >= 1 ``` **After code changes:** Adjust `numPartitions` and can return the correct answers: ``` Partition 0: id < 2 or id is null Partition 1: id >= 2 AND id < 3 Partition 2: id >= 3 AND id < 4 Partition 3: id >= 4 ``` **Issue 3: java.lang.ArithmeticException when numPartitions is zero** ```Scala spark.read.jdbc( url = urlWithUserAndPass, table = "TEST.seq", columnName = "id", lowerBound = 0, upperBound = 4, numPartitions = 0, connectionProperties = new Properties) ``` **Before code changes:** Got the following exception: ``` java.lang.ArithmeticException: / by zero ``` **After code changes:** Able to return a correct answer by disabling column partitioning when numPartitions is equal to or less than zero #### How was this patch tested? Added test cases to verify the results Author: gatorsmile <gatorsmile@gmail.com> Closes #13773 from gatorsmile/jdbcPartitioning.
* [SPARK-13792][SQL] Limit logging of bad records in CSV data sourceReynold Xin2016-06-205-15/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This pull request adds a new option (maxMalformedLogPerPartition) in CSV reader to limit the maximum of logging message Spark generates per partition for malformed records. The error log looks something like ``` 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: More than 10 malformed records have been found on this partition. Malformed records from now on will not be logged. ``` Closes #12173 ## How was this patch tested? Manually tested. Author: Reynold Xin <rxin@databricks.com> Closes #13795 from rxin/SPARK-13792.
* [SPARK-15294][R] Add `pivot` to SparkRDongjoon Hyun2016-06-204-0/+73
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds `pivot` function to SparkR for API parity. Since this PR is based on https://github.com/apache/spark/pull/13295 , mhnatiuk should be credited for the work he did. ## How was this patch tested? Pass the Jenkins tests (including new testcase.) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13786 from dongjoon-hyun/SPARK-15294.
* [SPARK-16086] [SQL] fix Python UDF without arguments (for 1.6)Davies Liu2016-06-202-6/+8
| | | | | | | | | | | | | Fix the bug for Python UDF that does not have any arguments. Added regression tests. Author: Davies Liu <davies.liu@gmail.com> Closes #13793 from davies/fix_no_arguments. (cherry picked from commit abe36c53d126bb580e408a45245fd8e81806869c) Signed-off-by: Davies Liu <davies.liu@gmail.com>
* remove duplicated docs in dapplyNarine Kokhlikyan2016-06-202-2/+4
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Removed unnecessary duplicated documentation in dapply and dapplyCollect. In this pull request I created separate R docs for dapply and dapplyCollect - kept dapply's documentation separate from dapplyCollect's and referred from one to another via a link. ## How was this patch tested? Existing test cases. Author: Narine Kokhlikyan <narine@slice.com> Closes #13790 from NarineK/dapply-docs-fix.
* [SPARK-16079][PYSPARK][ML] Added missing import for ↵Bryan Cutler2016-06-202-2/+6
| | | | | | | | | | | | | | | | DecisionTreeRegressionModel used in GBTClassificationModel ## What changes were proposed in this pull request? Fixed missing import for DecisionTreeRegressionModel used in GBTClassificationModel trees method. ## How was this patch tested? Local tests Author: Bryan Cutler <cutlerb@gmail.com> Closes #13787 from BryanCutler/pyspark-GBTClassificationModel-import-SPARK-16079.
* [SPARK-16061][SQL][MINOR] The property ↵Kousuke Saruta2016-06-201-1/+1
| | | | | | | | | | | | | | "spark.streaming.stateStore.maintenanceInterval" should be renamed to "spark.sql.streaming.stateStore.maintenanceInterval" ## What changes were proposed in this pull request? The property spark.streaming.stateStore.maintenanceInterval should be renamed and harmonized with other properties related to Structured Streaming like spark.sql.streaming.stateStore.minDeltasForSnapshot. ## How was this patch tested? Existing unit tests. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #13777 from sarutak/SPARK-16061.
* [SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harmonize the behavior of ↵Tathagata Das2016-06-203-56/+420
| | | | | | | | | | | | | | | | | | | | | | | | | DataFrameReader.text/csv/json/parquet/orc ## What changes were proposed in this pull request? Issues with current reader behavior. - `text()` without args returns an empty DF with no columns -> inconsistent, its expected that text will always return a DF with `value` string field, - `textFile()` without args fails with exception because of the above reason, it expected the DF returned by `text()` to have a `value` field. - `orc()` does not have var args, inconsistent with others - `json(single-arg)` was removed, but that caused source compatibility issues - [SPARK-16009](https://issues.apache.org/jira/browse/SPARK-16009) - user specified schema was not respected when `text/csv/...` were used with no args - [SPARK-16007](https://issues.apache.org/jira/browse/SPARK-16007) The solution I am implementing is to do the following. - For each format, there will be a single argument method, and a vararg method. For json, parquet, csv, text, this means adding json(string), etc.. For orc, this means adding orc(varargs). - Remove the special handling of text(), csv(), etc. that returns empty dataframe with no fields. Rather pass on the empty sequence of paths to the datasource, and let each datasource handle it right. For e.g, text data source, should return empty DF with schema (value: string) - Deduped docs and fixed their formatting. ## How was this patch tested? Added new unit tests for Scala and Java tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13727 from tdas/SPARK-15982.
* [SPARK-15863][SQL][DOC] Initial SQL programming guide update for Spark 2.0Cheng Lian2016-06-201-288/+317
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Initial SQL programming guide update for Spark 2.0. Contents like 1.6 to 2.0 migration guide are still incomplete. We may also want to add more examples for Scala/Java Dataset typed transformations. ## How was this patch tested? N/A Author: Cheng Lian <lian@databricks.com> Closes #13592 from liancheng/sql-programming-guide-2.0.
* [SPARK-14995][R] Add `since` tag in Roxygen documentation for SparkR API methodsDongjoon Hyun2016-06-2014-34/+340
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds `since` tags to Roxygen documentation according to the previous documentation archive. https://home.apache.org/~dongjoon/spark-2.0.0-docs/api/R/ ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13734 from dongjoon-hyun/SPARK-14995.
* [MINOR] Closing stale pull requests.Sean Owen2016-06-200-0/+0
| | | | | | | | | | | Closes #13114 Closes #10187 Closes #13432 Closes #13550 Author: Sean Owen <sowen@cloudera.com> Closes #13781 from srowen/CloseStalePR.
* [SPARK-15159][SPARKR] SparkSession roxygen2 doc, programming guide, example ↵Felix Cheung2016-06-209-239/+162
| | | | | | | | | | | | | | | | | updates ## What changes were proposed in this pull request? roxygen2 doc, programming guide, example updates ## How was this patch tested? manual checks shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13751 from felixcheung/rsparksessiondoc.
* [SPARK-16053][R] Add `spark_partition_id` in SparkRDongjoon Hyun2016-06-204-0/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds `spark_partition_id` virtual column function in SparkR for API parity. The following is just an example to illustrate a SparkR usage on a partitioned parquet table created by `spark.range(10).write.mode("overwrite").parquet("/tmp/t1")`. ```r > collect(select(read.parquet('/tmp/t1'), c('id', spark_partition_id()))) id SPARK_PARTITION_ID() 1 3 0 2 4 0 3 8 1 4 9 1 5 0 2 6 1 3 7 2 4 8 5 5 9 6 6 10 7 7 ``` ## How was this patch tested? Pass the Jenkins tests (including new testcase). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13768 from dongjoon-hyun/SPARK-16053.
* [SPARKR] fix R roxygen2 doc for count on GroupedDataFelix Cheung2016-06-201-1/+3
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? fix code doc ## How was this patch tested? manual shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13782 from felixcheung/rcountdoc.
* [SPARK-16028][SPARKR] spark.lapply can work with active contextFelix Cheung2016-06-202-10/+16
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? spark.lapply and setLogLevel ## How was this patch tested? unit test shivaram thunterdb Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13752 from felixcheung/rlapply.
* [SPARK-16051][R] Add `read.orc/write.orc` to SparkRDongjoon Hyun2016-06-205-1/+74
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This issue adds `read.orc/write.orc` to SparkR for API parity. ## How was this patch tested? Pass the Jenkins tests (with new testcases). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13763 from dongjoon-hyun/SPARK-16051.
* [SPARK-16029][SPARKR] SparkR add dropTempView and deprecate dropTempTableFelix Cheung2016-06-203-13/+41
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add dropTempView and deprecate dropTempTable ## How was this patch tested? unit tests shivaram liancheng Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13753 from felixcheung/rdroptempview.
* [SPARK-16059][R] Add `monotonically_increasing_id` function in SparkRDongjoon Hyun2016-06-204-1/+34
| | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds `monotonically_increasing_id` column function in SparkR for API parity. After this PR, SparkR supports the followings. ```r > df <- read.json("examples/src/main/resources/people.json") > collect(select(df, monotonically_increasing_id(), df$name, df$age)) monotonically_increasing_id() name age 1 0 Michael NA 2 1 Andy 30 3 2 Justin 19 ``` ## How was this patch tested? Pass the Jenkins tests (with added testcase). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13774 from dongjoon-hyun/SPARK-16059.
* [SPARK-16050][TESTS] Remove the flaky test: ConsoleSinkSuiteShixiong Zhu2016-06-201-99/+0
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? ConsoleSinkSuite just collects content from stdout and compare them with the expected string. However, because Spark may not stop some background threads at once, there is a race condition that other threads are outputting logs to **stdout** while ConsoleSinkSuite is running. Then it will make ConsoleSinkSuite fail. Therefore, I just deleted `ConsoleSinkSuite`. If we want to test ConsoleSinkSuite in future, we should refactoring ConsoleSink to make it testable instead of depending on stdout. Therefore, this test is useless and I just delete it. ## How was this patch tested? Just removed a flaky test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13776 from zsxwing/SPARK-16050.
* [SPARK-16030][SQL] Allow specifying static partitions when inserting to data ↵Yin Huai2016-06-209-25/+436
| | | | | | | | | | | | | | | | source tables ## What changes were proposed in this pull request? This PR adds the static partition support to INSERT statement when the target table is a data source table. ## How was this patch tested? New tests in InsertIntoHiveTableSuite and DataSourceAnalysisSuite. **Note: This PR is based on https://github.com/apache/spark/pull/13766. The last commit is the actual change.** Author: Yin Huai <yhuai@databricks.com> Closes #13769 from yhuai/SPARK-16030-1.
* [SPARK-16036][SPARK-16037][SPARK-16034][SQL] Follow up code clean up and ↵Yin Huai2016-06-1910-61/+98
| | | | | | | | | | | | | | improvement ## What changes were proposed in this pull request? This PR is the follow-up PR for https://github.com/apache/spark/pull/13754/files and https://github.com/apache/spark/pull/13749. I will comment inline to explain my changes. ## How was this patch tested? Existing tests. Author: Yin Huai <yhuai@databricks.com> Closes #13766 from yhuai/caseSensitivity.
* [SPARK-16031] Add debug-only socket source in Structured StreamingMatei Zaharia2016-06-199-0/+293
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch adds a text-based socket source similar to the one in Spark Streaming for debugging and tutorials. The source is clearly marked as debug-only so that users don't try to run it in production applications, because this type of source cannot provide HA without storing a lot of state in Spark. ## How was this patch tested? Unit tests and manual tests in spark-shell. Author: Matei Zaharia <matei@databricks.com> Closes #13748 from mateiz/socket-source.
* [SPARK-16040][MLLIB][DOC] spark.mllib PIC document extra line of refernecewm624@hotmail.com2016-06-191-4/+0
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In the 2.0 document, Line "A full example that produces the experiment described in the PIC paper can be found under examples/." is redundant. There is already "Find full example code at "examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala" in the Spark repo.". We should remove the first line, which is consistent with other documents. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Manual test Author: wm624@hotmail.com <wm624@hotmail.com> Closes #13755 from wangmiao1981/doc.
* [SPARK-15942][REPL] Unblock `:reset` command in REPL.Prashant Sharma2016-06-192-3/+16
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull (Paste from JIRA issue.) As a follow up for SPARK-15697, I have following semantics for `:reset` command. On `:reset` we forget all that user has done but not the initialization of spark. To avoid confusion or make it more clear, we show the message `spark` and `sc` are not erased, infact they are in same state as they were left by previous operations done by the user. While doing above, somewhere I felt that this is not usually what reset means. But an accidental shutdown of a cluster can be very costly, so may be in that sense this is less surprising and still useful. ## How was this patch tested? Manually, by calling `:reset` command, by both altering the state of SparkContext and creating some local variables. Author: Prashant Sharma <prashant@apache.org> Author: Prashant Sharma <prashsh1@in.ibm.com> Closes #13661 from ScrapCodes/repl-reset-command.
* [SPARK-15613] [SQL] Fix incorrect days to millis conversion due to Daylight ↵Davies Liu2016-06-194-4/+129
| | | | | | | | | | | | | | | | | | Saving Time ## What changes were proposed in this pull request? Internally, we use Int to represent a date (the days since 1970-01-01), when we convert that into unix timestamp (milli-seconds since epoch in UTC), we get the offset of a timezone using local millis (the milli-seconds since 1970-01-01 in a timezone), but TimeZone.getOffset() expect unix timestamp, the result could be off by one hour (in Daylight Saving Time (DST) or not). This PR change to use best effort approximate of posix timestamp to lookup the offset. In the event of changing of DST, Some time is not defined (for example, 2016-03-13 02:00:00 PST), or could lead to multiple valid result in UTC (for example, 2016-11-06 01:00:00), this best effort approximate should be enough in practice. ## How was this patch tested? Added regression tests. Author: Davies Liu <davies@databricks.com> Closes #13652 from davies/fix_timezone.
* [SPARK-16034][SQL] Checks the partition columns when calling ↵Sean Zhong2016-06-183-22/+50
| | | | | | | | | | | | | | | | dataFrame.write.mode("append").saveAsTable ## What changes were proposed in this pull request? `DataFrameWriter` can be used to append data to existing data source tables. It becomes tricky when partition columns used in `DataFrameWriter.partitionBy(columns)` don't match the actual partition columns of the underlying table. This pull request enforces the check so that the partition columns of these two always match. ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13749 from clockfly/SPARK-16034.
* [SPARK-16036][SPARK-16037][SQL] fix various table insertion problemsWenchen Fan2016-06-1812-185/+104
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? The current table insertion has some weird behaviours: 1. inserting into a partitioned table with mismatch columns has confusing error message for hive table, and wrong result for datasource table 2. inserting into a partitioned table without partition list has wrong result for hive table. This PR fixes these 2 problems. ## How was this patch tested? new test in hive `SQLQuerySuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #13754 from cloud-fan/insert2.
* [SPARK-15973][PYSPARK] Fix GroupedData DocumentationJosh Howes2016-06-171-11/+11
| | | | | | | | | | | | | | | | | *This contribution is my original work and that I license the work to the project under the project's open source license.* ## What changes were proposed in this pull request? Documentation updates to PySpark's GroupedData ## How was this patch tested? Manual Tests Author: Josh Howes <josh.howes@gmail.com> Author: Josh Howes <josh.howes@maxpoint.com> Closes #13724 from josh-howes/bugfix/SPARK-15973.
* [SPARK-16023][SQL] Move InMemoryRelation to its own fileAndrew Or2016-06-172-185/+211
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Improve readability of `InMemoryTableScanExec.scala`, which has too much stuff in it. ## How was this patch tested? Jenkins Author: Andrew Or <andrew@databricks.com> Closes #13742 from andrewor14/move-inmemory-relation.
* [SPARK-15803] [PYSPARK] Support with statement syntax for SparkSessionJeff Zhang2016-06-171-0/+16
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Support with statement syntax for SparkSession in pyspark ## How was this patch tested? Manually verify it. Although I can add unit test for it, it would affect other unit test because the SparkContext is stopped after the with statement. Author: Jeff Zhang <zjffdu@apache.org> Closes #13541 from zjffdu/SPARK-15803.
* [SPARK-16035][PYSPARK] Fix SparseVector parser assertion for end parenthesisandreapasqua2016-06-171-1/+1
| | | | | | | | | | | ## What changes were proposed in this pull request? The check on the end parenthesis of the expression to parse was using the wrong variable. I corrected that. ## How was this patch tested? Manual test Author: andreapasqua <andrea@radius.com> Closes #13750 from andreapasqua/sparse-vector-parser-assertion-fix.
* [SPARK-16020][SQL] Fix complete mode aggregation with console sinkShixiong Zhu2016-06-173-1/+105
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? We cannot use `limit` on DataFrame in ConsoleSink because it will use a wrong planner. This PR just collects `DataFrame` and calls `show` on a batch DataFrame based on the result. This is fine since ConsoleSink is only for debugging. ## How was this patch tested? Manually confirmed ConsoleSink now works with complete mode aggregation. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13740 from zsxwing/complete-console.
* [SPARK-15159][SPARKR] SparkR SparkSession APIFelix Cheung2016-06-1724-186/+420
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR introduces the new SparkSession API for SparkR. `sparkR.session.getOrCreate()` and `sparkR.session.stop()` "getOrCreate" is a bit unusual in R but it's important to name this clearly. SparkR implementation should - SparkSession is the main entrypoint (vs SparkContext; due to limited functionality supported with SparkContext in SparkR) - SparkSession replaces SQLContext and HiveContext (both a wrapper around SparkSession, and because of API changes, supporting all 3 would be a lot more work) - Changes to SparkSession is mostly transparent to users due to SPARK-10903 - Full backward compatibility is expected - users should be able to initialize everything just in Spark 1.6.1 (`sparkR.init()`), but with deprecation warning - Mostly cosmetic changes to parameter list - users should be able to move to `sparkR.session.getOrCreate()` easily - An advanced syntax with named parameters (aka varargs aka "...") is supported; that should be closer to the Builder syntax that is in Scala/Python (which unfortunately does not work in R because it will look like this: `enableHiveSupport(config(config(master(appName(builder(), "foo"), "local"), "first", "value"), "next, "value"))` - Updating config on an existing SparkSession is supported, the behavior is the same as Python, in which config is applied to both SparkContext and SparkSession - Some SparkSession changes are not matched in SparkR, mostly because it would be breaking API change: `catalog` object, `createOrReplaceTempView` - Other SQLContext workarounds are replicated in SparkR, eg. `tables`, `tableNames` - `sparkR` shell is updated to use the SparkSession entrypoint (`sqlContext` is removed, just like with Scale/Python) - All tests are updated to use the SparkSession entrypoint - A bug in `read.jdbc` is fixed TODO - [x] Add more tests - [ ] Separate PR - update all roxygen2 doc coding example - [ ] Separate PR - update SparkR programming guide ## How was this patch tested? unit tests, manual tests shivaram sun-rui rxin Author: Felix Cheung <felixcheung_m@hotmail.com> Author: felixcheung <felixcheung_m@hotmail.com> Closes #13635 from felixcheung/rsparksession.
* [SPARK-15946][MLLIB] Conversion between old/new vector columns in a ↵Xiangrui Meng2016-06-172-0/+96
| | | | | | | | | | | | | | | | | | DataFrame (Python) ## What changes were proposed in this pull request? This PR implements python wrappers for #13662 to convert old/new vector columns in a DataFrame. ## How was this patch tested? doctest in Python cc: yanboliang Author: Xiangrui Meng <meng@databricks.com> Closes #13731 from mengxr/SPARK-15946.
* [SPARK-15129][R][DOC] R API changes in MLGayathriMurali2016-06-172-60/+21
| | | | | | | | | | ## What changes were proposed in this pull request? Make user guide changes to SparkR documentation for all changes that happened in 2.0 to Machine Learning APIs Author: GayathriMurali <gayathri.m@intel.com> Closes #13285 from GayathriMurali/SPARK-15129.
* [SPARK-16033][SQL] insertInto() can't be used together with partitionBy()Cheng Lian2016-06-172-3/+46
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? When inserting into an existing partitioned table, partitioning columns should always be determined by catalog metadata of the existing table to be inserted. Extra `partitionBy()` calls don't make sense, and mess up existing data because newly inserted data may have wrong partitioning directory layout. ## How was this patch tested? New test case added in `InsertIntoHiveTableSuite`. Author: Cheng Lian <lian@databricks.com> Closes #13747 from liancheng/spark-16033-insert-into-without-partition-by.
* [SPARK-15916][SQL] JDBC filter push down should respect operator precedencehyukjinkwon2016-06-172-2/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes the problem that the precedence order is messed when pushing where-clause expression to JDBC layer. **Case 1:** For sql `select * from table where (a or b) and c`, the where-clause is wrongly converted to JDBC where-clause `a or (b and c)` after filter push down. The consequence is that JDBC may returns less or more rows than expected. **Case 2:** For sql `select * from table where always_false_condition`, the result table may not be empty if the JDBC RDD is partitioned using where-clause: ``` spark.read.jdbc(url, table, predicates = Array("partition 1 where clause", "partition 2 where clause"...) ``` ## How was this patch tested? Unit test. This PR also close #13640 Author: hyukjinkwon <gurwls223@gmail.com> Author: Sean Zhong <seanzhong@databricks.com> Closes #13743 from clockfly/SPARK-15916.
* [SPARK-16005][R] Add `randomSplit` to SparkRDongjoon Hyun2016-06-174-0/+60
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds `randomSplit` to SparkR for API parity. ## How was this patch tested? Pass the Jenkins tests (with new testcase.) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13721 from dongjoon-hyun/SPARK-16005.
* [SPARK-15925][SPARKR] R DataFrame add back registerTempTable, add testsFelix Cheung2016-06-174-18/+57
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add registerTempTable to DataFrame with Deprecate ## How was this patch tested? unit tests shivaram liancheng Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13722 from felixcheung/rregistertemptable.
* [SPARK-16014][SQL] Rename optimizer rules to be more consistentReynold Xin2016-06-176-22/+19
| | | | | | | | | | | | ## What changes were proposed in this pull request? This small patch renames a few optimizer rules to make the naming more consistent, e.g. class name start with a verb. The main important "fix" is probably SamplePushDown -> PushProjectThroughSample. SamplePushDown is actually the wrong name, since the rule is not about pushing Sample down. ## How was this patch tested? Updated test cases. Author: Reynold Xin <rxin@databricks.com> Closes #13732 from rxin/SPARK-16014.
* [SPARK-16017][CORE] Send hostname from CoarseGrainedExecutorBackend to driverShixiong Zhu2016-06-175-11/+12
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? [SPARK-15395](https://issues.apache.org/jira/browse/SPARK-15395) changes the behavior that how the driver gets the executor host and the driver will get the executor IP address instead of the host name. This PR just sends the hostname from executors to driver so that driver can pass it to TaskScheduler. ## How was this patch tested? Existing unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13741 from zsxwing/SPARK-16017.