aboutsummaryrefslogtreecommitdiff
path: root/python
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-15935][PYSPARK] Enable test for sql/streaming.py and fix these testsShixiong Zhu2016-06-143-20/+54
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR just enables tests for sql/streaming.py and also fixes the failures. ## How was this patch tested? Existing unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13655 from zsxwing/python-streaming-test.
* [SPARK-15663][SQL] SparkSession.catalog.listFunctions shouldn't include the ↵Sandeep Singh2016-06-131-11/+1
| | | | | | | | | | | | | | list of built-in functions ## What changes were proposed in this pull request? SparkSession.catalog.listFunctions currently returns all functions, including the list of built-in functions. This makes the method not as useful because anytime it is run the result set contains over 100 built-in functions. ## How was this patch tested? CatalogSuite Author: Sandeep Singh <sandeep@techaddict.me> Closes #13413 from techaddict/SPARK-15663.
* [SPARK-15364][ML][PYSPARK] Implement PySpark picklers for ml.Vector and ↵Liang-Chi Hsieh2016-06-1314-16/+154
| | | | | | | | | | | | | | | ml.Matrix under spark.ml.python ## What changes were proposed in this pull request? Now we have PySpark picklers for new and old vector/matrix, individually. However, they are all implemented under `PythonMLlibAPI`. To separate spark.mllib from spark.ml, we should implement the picklers of new vector/matrix under `spark.ml.python` instead. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13219 from viirya/pyspark-pickler-ml.
* [SPARK-15898][SQL] DataFrameReader.text should return DataFrameWenchen Fan2016-06-121-4/+4
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? We want to maintain API compatibility for DataFrameReader.text, and will introduce a new API called DataFrameReader.textFile which returns Dataset[String]. affected PRs: https://github.com/apache/spark/pull/11731 https://github.com/apache/spark/pull/13104 https://github.com/apache/spark/pull/13184 ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #13604 from cloud-fan/revert.
* [SPARK-15840][SQL] Add two missing options in documentation and some option ↵hyukjinkwon2016-06-111-13/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | related changes ## What changes were proposed in this pull request? This PR 1. Adds the documentations for some missing options, `inferSchema` and `mergeSchema` for Python and Scala. 2. Fiixes `[[DataFrame]]` to ```:class:`DataFrame` ``` so that this can be shown - from ![2016-06-09 9 31 16](https://cloud.githubusercontent.com/assets/6477701/15929721/8b864734-2e89-11e6-83f6-207527de4ac9.png) - to (with class link) ![2016-06-09 9 31 00](https://cloud.githubusercontent.com/assets/6477701/15929717/8a03d728-2e89-11e6-8a3f-08294964db22.png) (Please refer [the latest documentation](https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/python/pyspark.sql.html)) 3. Moves `mergeSchema` option to `ParquetOptions` with removing unused options, `metastoreSchema` and `metastoreTableName`. They are not used anymore. They were removed in https://github.com/apache/spark/commit/e720dda42e806229ccfd970055c7b8a93eb447bf and there are no use cases as below: ```bash grep -r -e METASTORE_SCHEMA -e \"metastoreSchema\" -e \"metastoreTableName\" -e METASTORE_TABLE_NAME . ``` ``` ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala: private[sql] val METASTORE_SCHEMA = "metastoreSchema" ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala: private[sql] val METASTORE_TABLE_NAME = "metastoreTableName" ./sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala: ParquetFileFormat.METASTORE_TABLE_NAME -> TableIdentifier( ``` It only sets `metastoreTableName` in the last case but does not use the table name. 4. Sets the correct default values (in the documentation) for `compression` option for ORC(`snappy`, see [OrcOptions.scala#L33-L42](https://github.com/apache/spark/blob/3ded5bc4db2badc9ff49554e73421021d854306b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcOptions.scala#L33-L42)) and Parquet(`the value specified in SQLConf`, see [ParquetOptions.scala#L38-L47](https://github.com/apache/spark/blob/3ded5bc4db2badc9ff49554e73421021d854306b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala#L38-L47)) and `columnNameOfCorruptRecord` for JSON(`the value specified in SQLConf`, see [JsonFileFormat.scala#L53-L55](https://github.com/apache/spark/blob/4538443e276597530a27c6922e48503677b13956/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala#L53-L55) and [JsonFileFormat.scala#L105-L106](https://github.com/apache/spark/blob/4538443e276597530a27c6922e48503677b13956/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala#L105-L106)). ## How was this patch tested? Existing tests should cover this. Author: hyukjinkwon <gurwls223@gmail.com> Author: Hyukjin Kwon <gurwls223@gmail.com> Closes #13576 from HyukjinKwon/SPARK-15840.
* [SPARK-15585][SQL] Add doc for turning off quotationsTakeshi YAMAMURO2016-06-111-2/+4
| | | | | | | | | | | | ## What changes were proposed in this pull request? This pr is to add doc for turning off quotations because this behavior is different from `com.databricks.spark.csv`. ## How was this patch tested? Check behavior to put an empty string in csv options. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #13616 from maropu/SPARK-15585-2.
* [SPARK-15738][PYSPARK][ML] Adding Pyspark ml RFormula __str__ method similar ↵Bryan Cutler2016-06-101-0/+12
| | | | | | | | | | | | | | to Scala API ## What changes were proposed in this pull request? Adding __str__ to RFormula and model that will show the set formula param and resolved formula. This is currently present in the Scala API, found missing in PySpark during Spark 2.0 coverage review. ## How was this patch tested? run pyspark-ml tests locally Author: Bryan Cutler <cutlerb@gmail.com> Closes #13481 from BryanCutler/pyspark-ml-rformula_str-SPARK-15738.
* [SPARK-15837][ML][PYSPARK] Word2vec python add maxsentence parameterWeichenXu2016-06-101-5/+24
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Word2vec python add maxsentence parameter. ## How was this patch tested? Existing test. Author: WeichenXu <WeichenXu123@outlook.com> Closes #13578 from WeichenXu123/word2vec_python_add_maxsentence.
* [SPARK-15823][PYSPARK][ML] Add @property for 'accuracy' in MulticlassMetricsZheng RuiFeng2016-06-101-5/+2
| | | | | | | | | | | | ## What changes were proposed in this pull request? `accuracy` should be decorated with `property` to keep step with other methods in `pyspark.MulticlassMetrics`, like `weightedPrecision`, `weightedRecall`, etc ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13560 from zhengruifeng/add_accuracy_property.
* [SPARK-15788][PYSPARK][ML] PySpark IDFModel missing "idf" propertyJeff Zhang2016-06-091-0/+10
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? add method idf to IDF in pyspark ## How was this patch tested? add unit test Author: Jeff Zhang <zjffdu@apache.org> Closes #13540 from zjffdu/SPARK-15788.
* [SPARK-14900][ML][PYSPARK] Add accuracy and deprecate precison,recall,f1Zheng RuiFeng2016-06-061-0/+18
| | | | | | | | | | | | | ## What changes were proposed in this pull request? 1, add accuracy for MulticlassMetrics 2, deprecate overall precision,recall,f1 and recommend accuracy usage ## How was this patch tested? manual tests in pyspark shell Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13511 from zhengruifeng/deprecate_py_precisonrecall.
* [SPARK-15771][ML][EXAMPLES] Use 'accuracy' rather than 'precision' in many ↵Yanbo Liang2016-06-061-1/+1
| | | | | | | | | | | | | | | | | | ML examples ## What changes were proposed in this pull request? Since [SPARK-15617](https://issues.apache.org/jira/browse/SPARK-15617) deprecated ```precision``` in ```MulticlassClassificationEvaluator```, many ML examples broken. ```python pyspark.sql.utils.IllegalArgumentException: u'MulticlassClassificationEvaluator_4c3bb1d73d8cc0cedae6 parameter metricName given invalid value precision.' ``` We should use ```accuracy``` to replace ```precision``` in these examples. ## How was this patch tested? Offline tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13519 from yanboliang/spark-15771.
* [MINOR] Fix Typos 'an -> a'Zheng RuiFeng2016-06-069-12/+12
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? `an -> a` Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one. ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13515 from zhengruifeng/an_a.
* Revert "[SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour"Reynold Xin2016-06-051-39/+42
| | | | This reverts commit b7e8d1cb3ce932ba4a784be59744af8a8ef027ce.
* [SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivourTakeshi YAMAMURO2016-06-051-42/+39
| | | | | | | | | | | | | ## What changes were proposed in this pull request? This pr fixes the behaviour of `format("csv").option("quote", null)` along with one of spark-csv. Also, it explicitly sets default values for CSV options in python. ## How was this patch tested? Added tests in CSVSuite. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #13372 from maropu/SPARK-15585.
* [SPARK-15617][ML][DOC] Clarify that fMeasure in MulticlassMetrics is "micro" ↵Ruifeng Zheng2016-06-041-3/+1
| | | | | | | | | | | | | | | f1_score ## What changes were proposed in this pull request? 1, del precision,recall in `ml.MulticlassClassificationEvaluator` 2, update user guide for `mlllib.weightedFMeasure` ## How was this patch tested? local build Author: Ruifeng Zheng <ruifengz@foxmail.com> Closes #13390 from zhengruifeng/clarify_f1.
* [SPARK-15168][PYSPARK][ML] Add missing params to MultilayerPerceptronClassifierHolden Karau2016-06-031-9/+66
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? MultilayerPerceptronClassifier is missing step size, solver, and weights. Add these params. Also clarify the scaladoc a bit while we are updating these params. Eventually we should follow up and unify the HasSolver params (filed https://issues.apache.org/jira/browse/SPARK-15169 ) ## How was this patch tested? Doc tests Author: Holden Karau <holden@us.ibm.com> Closes #12943 from holdenk/SPARK-15168-add-missing-params-to-MultilayerPerceptronClassifier.
* [SPARK-15092][SPARK-15139][PYSPARK][ML] Pyspark TreeEnsemble missing methodsHolden Karau2016-06-022-1/+67
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add `toDebugString` and `totalNumNodes` to `TreeEnsembleModels` and add `toDebugString` to `DecisionTreeModel` ## How was this patch tested? Extended doc tests. Author: Holden Karau <holden@us.ibm.com> Closes #12919 from holdenk/SPARK-15139-pyspark-treeEnsemble-missing-methods.
* [SPARK-15587][ML] ML 2.0 QA: Scala APIs audit for ml.featureYanbo Liang2016-06-011-16/+13
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? ML 2.0 QA: Scala APIs audit for ml.feature. Mainly include: * Remove seed for ```QuantileDiscretizer```, since we use ```approxQuantile``` to produce bins and ```seed``` is useless. * Scala API docs update. * Sync Scala and Python API docs for these changes. ## How was this patch tested? Exist tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13410 from yanboliang/spark-15587.
* [SPARK-15686][SQL] Move user-facing streaming classes into sql.streamingReynold Xin2016-06-012-2/+3
| | | | | | | | | | | | ## What changes were proposed in this pull request? This patch moves all user-facing structured streaming classes into sql.streaming. As part of this, I also added some since version annotation to methods and classes that don't have them. ## How was this patch tested? Updated tests to reflect the moves. Author: Reynold Xin <rxin@databricks.com> Closes #13429 from rxin/SPARK-15686.
* [SPARK-15517][SQL][STREAMING] Add support for complete output mode in ↵Tathagata Das2016-05-312-3/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | Structure Streaming ## What changes were proposed in this pull request? Currently structured streaming only supports append output mode. This PR adds the following. - Added support for Complete output mode in the internal state store, analyzer and planner. - Added public API in Scala and Python for users to specify output mode - Added checks for unsupported combinations of output mode and DF operations - Plans with no aggregation should support only Append mode - Plans with aggregation should support only Update and Complete modes - Default output mode is Append mode (**Question: should we change this to automatically set to Complete mode when there is aggregation?**) - Added support for Complete output mode in Memory Sink. So Memory Sink internally supports append and complete, update. But from public API only Complete and Append output modes are supported. ## How was this patch tested? Unit tests in various test suites - StreamingAggregationSuite: tests for complete mode - MemorySinkSuite: tests for checking behavior in Append and Complete modes. - UnsupportedOperationSuite: tests for checking unsupported combinations of DF ops and output modes - DataFrameReaderWriterSuite: tests for checking that output mode cannot be called on static DFs - Python doc test and existing unit tests modified to call write.outputMode. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13286 from tdas/complete-mode.
* [MINOR][DOC][ML] ml.clustering scala & python api doc syncYanbo Liang2016-05-311-10/+25
| | | | | | | | | | | | ## What changes were proposed in this pull request? Since we done Scala API audit for ml.clustering at #13148, we should also fix and update the corresponding Python API docs to keep them in sync. ## How was this patch tested? Docs change, no tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13291 from yanboliang/spark-15361-followup.
* Revert "[SPARK-11753][SQL][TEST-HADOOP2.2] Make allowNonNumericNumbers ↵Shixiong Zhu2016-05-311-3/+0
| | | | | | | | | | | | | | | | option work ## What changes were proposed in this pull request? This reverts commit c24b6b679c3efa053f7de19be73eb36dc70d9930. Sent a PR to run Jenkins tests due to the revert conflicts of `dev/deps/spark-deps-hadoop*`. ## How was this patch tested? Jenkins unit tests, integration tests, manual tests) Author: Shixiong Zhu <shixiong@databricks.com> Closes #13417 from zsxwing/revert-SPARK-11753.
* [SPARK-15008][ML][PYSPARK] Add integration test for OneVsRestyinxusen2016-05-271-23/+46
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? 1. Add `_transfer_param_map_to/from_java` for OneVsRest; 2. Add `_compare_params` in ml/tests.py to help compare params. 3. Add `test_onevsrest` as the integration test for OneVsRest. ## How was this patch tested? Python unit test. Author: yinxusen <yinxusen@gmail.com> Closes #12875 from yinxusen/SPARK-15008.
* [MINOR] Fix Typos 'a -> an'Zheng RuiFeng2016-05-265-7/+7
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? `a` -> `an` I use regex to generate potential error lines: `grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml/*/*scala` and review them line by line. ## How was this patch tested? local build `lint-java` checking Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13317 from zhengruifeng/a_an.
* [SPARK-15520][SQL] Also set sparkContext confs when using SparkSession ↵Eric Liang2016-05-261-1/+3
| | | | | | | | | | | | | | | | builder in pyspark ## What changes were proposed in this pull request? Also sets confs in the underlying sc when using SparkSession.builder.getOrCreate(). This is a bug-fix from a post-merge comment in https://github.com/apache/spark/pull/13289 ## How was this patch tested? Python doc-tests. Author: Eric Liang <ekl@databricks.com> Closes #13309 from ericl/spark-15520-1.
* [SPARK-15493][SQL] default QuoteEscapingEnabled flag to true when writing CSVJurriaan Pruis2016-05-251-1/+6
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Default QuoteEscapingEnabled flag to true when writing CSV and add an escapeQuotes option to be able to change this. See https://github.com/uniVocity/univocity-parsers/blob/f3eb2af26374940e60d91d1703bde54619f50c51/src/main/java/com/univocity/parsers/csv/CsvWriterSettings.java#L231-L247 This change is needed to be able to write RFC 4180 compatible CSV files (https://tools.ietf.org/html/rfc4180#section-2) https://issues.apache.org/jira/browse/SPARK-15493 ## How was this patch tested? Added a test that verifies the output is quoted correctly. Author: Jurriaan Pruis <email@jurriaanpruis.nl> Closes #13267 from jurriaan/quote-escaping.
* [SPARK-15500][DOC][ML][PYSPARK] Remove default value in Param doc field in ALSNick Pentreath2016-05-251-4/+2
| | | | | | | | | | | | Remove "Default: MEMORY_AND_DISK" from `Param` doc field in ALS storage level params. This fixes up the output of `explainParam(s)` so that default values are not displayed twice. We can revisit in the case that [SPARK-15130](https://issues.apache.org/jira/browse/SPARK-15130) moves ahead with adding defaults in some way to PySpark param doc fields. Tests N/A. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13277 from MLnick/SPARK-15500-als-remove-default-storage-param.
* [SPARK-15520][SQL] SparkSession builder in python should also allow ↵Eric Liang2016-05-251-11/+24
| | | | | | | | | | | | | | | | | | overriding confs of existing sessions ## What changes were proposed in this pull request? This fixes the python SparkSession builder to allow setting confs correctly. This was a leftover TODO from https://github.com/apache/spark/pull/13200. ## How was this patch tested? Python doc tests. cc andrewor14 Author: Eric Liang <ekl@databricks.com> Closes #13289 from ericl/spark-15520.
* [SPARK-15412][PYSPARK][SPARKR][DOCS] Improve linear isotonic regression ↵Holden Karau2016-05-241-13/+17
| | | | | | | | | | | | | | | | | | pydoc & doc build insturctions ## What changes were proposed in this pull request? PySpark: Add links to the predictors from the models in regression.py, improve linear and isotonic pydoc in minor ways. User guide / R: Switch the installed package list to be enough to build the R docs on a "fresh" install on ubuntu and add sudo to match the rest of the commands. User Guide: Add a note about using gem2.0 for systems with both 1.9 and 2.0 (e.g. some ubuntu but maybe more). ## How was this patch tested? built pydocs locally, tested new user build instructions Author: Holden Karau <holden@us.ibm.com> Closes #13199 from holdenk/SPARK-15412-improve-linear-isotonic-regression-pydoc.
* [SPARK-15433] [PYSPARK] PySpark core test should not use SerDe from ↵Liang-Chi Hsieh2016-05-241-2/+2
| | | | | | | | | | | | | | | PythonMLLibAPI ## What changes were proposed in this pull request? Currently PySpark core test uses the `SerDe` from `PythonMLLibAPI` which includes many MLlib things. It should use `SerDeUtil` instead. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13214 from viirya/pycore-use-serdeutil.
* [SPARK-11753][SQL][TEST-HADOOP2.2] Make allowNonNumericNumbers option workLiang-Chi Hsieh2016-05-241-0/+3
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Jackson suppprts `allowNonNumericNumbers` option to parse non-standard non-numeric numbers such as "NaN", "Infinity", "INF". Currently used Jackson version (2.5.3) doesn't support it all. This patch upgrades the library and make the two ignored tests in `JsonParsingOptionsSuite` passed. ## How was this patch tested? `JsonParsingOptionsSuite`. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9759 from viirya/fix-json-nonnumric.
* [SPARK-15442][ML][PYSPARK] Add 'relativeError' param to PySpark ↵Nick Pentreath2016-05-241-15/+36
| | | | | | | | | | | | | | | | | | QuantileDiscretizer This PR adds the `relativeError` param to PySpark's `QuantileDiscretizer` to match Scala. Also cleaned up a duplication of `numBuckets` where the param is both a class and instance attribute (I removed the instance attr to match the style of params throughout `ml`). Finally, cleaned up the docs for `QuantileDiscretizer` to reflect that it now uses `approxQuantile`. ## How was this patch tested? A little doctest and built API docs locally to check HTML doc generation. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13228 from MLnick/SPARK-15442-py-relerror-param.
* [SPARK-15397][SQL] fix string udf locate as hiveDaoyuan Wang2016-05-231-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? in hive, `locate("aa", "aaa", 0)` would yield 0, `locate("aa", "aaa", 1)` would yield 1 and `locate("aa", "aaa", 2)` would yield 2, while in Spark, `locate("aa", "aaa", 0)` would yield 1, `locate("aa", "aaa", 1)` would yield 2 and `locate("aa", "aaa", 2)` would yield 0. This results from the different understanding of the third parameter in udf `locate`. It means the starting index and starts from 1, so when we use 0, the return would always be 0. ## How was this patch tested? tested with modified `StringExpressionsSuite` and `StringFunctionsSuite` Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #13186 from adrian-wang/locate.
* [SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext ↵WeichenXu2016-05-2322-232/+298
| | | | | | | | | | | | | | | | with SparkSession using builder pattern in python test code ## What changes were proposed in this pull request? Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code. ## How was this patch tested? Existing test. Author: WeichenXu <WeichenXu123@outlook.com> Closes #13242 from WeichenXu123/python_doctest_update_sparksession.
* [MINOR][SQL][DOCS] Add notes of the deterministic assumption on UDF functionsDongjoon Hyun2016-05-231-0/+3
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Spark assumes that UDF functions are deterministic. This PR adds explicit notes about that. ## How was this patch tested? It's only about docs. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13087 from dongjoon-hyun/SPARK-15282.
* [SPARK-15456][PYSPARK] Fixed PySpark shell context initialization when ↵Bryan Cutler2016-05-201-2/+2
| | | | | | | | | | | | | | | | HiveConf not present ## What changes were proposed in this pull request? When PySpark shell cannot find HiveConf, it will fallback to create a SparkSession from a SparkContext. This fixes a bug caused by using a variable to SparkContext before it was initialized. ## How was this patch tested? Manually starting PySpark shell and using the SparkContext Author: Bryan Cutler <cutlerb@gmail.com> Closes #13237 from BryanCutler/pyspark-shell-session-context-SPARK-15456.
* [SPARK-15444][PYSPARK][ML][HOTFIX] Default value mismatch of param ↵Liang-Chi Hsieh2016-05-201-6/+5
| | | | | | | | | | | | | | | linkPredictionCol for GeneralizedLinearRegression ## What changes were proposed in this pull request? Default value mismatch of param linkPredictionCol for GeneralizedLinearRegression between PySpark and Scala. That is because default value conflict between #13106 and #13129. This causes ml.tests failed. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13220 from viirya/hotfix-regresstion.
* [SPARK-15417][SQL][PYTHON] PySpark shell always uses in-memory catalogAndrew Or2016-05-192-3/+11
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? There is no way to use the Hive catalog in `pyspark-shell`. This is because we used to create a `SparkContext` before calling `SparkSession.enableHiveSupport().getOrCreate()`, which just gets the existing `SparkContext` instead of creating a new one. As a result, `spark.sql.catalogImplementation` was never propagated. ## How was this patch tested? Manual. Author: Andrew Or <andrew@databricks.com> Closes #13203 from andrewor14/fix-pyspark-shell.
* [SPARK-15075][SPARK-15345][SQL] Clean up SparkSession builder and propagate ↵Reynold Xin2016-05-192-4/+18
| | | | | | | | | | | | | | | | config options to existing sessions if specified ## What changes were proposed in this pull request? Currently SparkSession.Builder use SQLContext.getOrCreate. It should probably the the other way around, i.e. all the core logic goes in SparkSession, and SQLContext just calls that. This patch does that. This patch also makes sure config options specified in the builder are propagated to the existing (and of course the new) SparkSession. ## How was this patch tested? Updated tests to reflect the change, and also introduced a new SparkSessionBuilderSuite that should cover all the branches. Author: Reynold Xin <rxin@databricks.com> Closes #13200 from rxin/SPARK-15075.
* [MINOR][ML][PYSPARK] ml.evaluation Scala and Python API syncYanbo Liang2016-05-191-4/+1
| | | | | | | | | | | | ## What changes were proposed in this pull request? ```ml.evaluation``` Scala and Python API sync. ## How was this patch tested? Only API docs change, no new tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13195 from yanboliang/evaluation-doc.
* [SPARK-15392][SQL] fix default value of size estimation of logical planDavies Liu2016-05-191-1/+1
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? We use autoBroadcastJoinThreshold + 1L as the default value of size estimation, that is not good in 2.0, because we will calculate the size based on size of schema, then the estimation could be less than autoBroadcastJoinThreshold if you have an SELECT on top of an DataFrame created from RDD. This PR change the default value to Long.MaxValue. ## How was this patch tested? Added regression tests. Author: Davies Liu <davies@databricks.com> Closes #13183 from davies/fix_default_size.
* [SPARK-15316][PYSPARK][ML] Add linkPredictionCol to GeneralizedLinearRegressionHolden Karau2016-05-191-11/+35
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add linkPredictionCol to GeneralizedLinearRegression and fix the PyDoc to generate the bullet list ## How was this patch tested? doctests & built docs locally Author: Holden Karau <holden@us.ibm.com> Closes #13106 from holdenk/SPARK-15316-add-linkPredictionCol-toGeneralizedLinearRegression.
* [SPARK-14603][SQL][FOLLOWUP] Verification of Metadata Operations by Session ↵gatorsmile2016-05-192-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Catalog #### What changes were proposed in this pull request? This follow-up PR is to address the remaining comments in https://github.com/apache/spark/pull/12385 The major change in this PR is to issue better error messages in PySpark by using the mechanism that was proposed by davies in https://github.com/apache/spark/pull/7135 For example, in PySpark, if we input the following statement: ```python >>> l = [('Alice', 1)] >>> df = sqlContext.createDataFrame(l) >>> df.createTempView("people") >>> df.createTempView("people") ``` Before this PR, the exception we will get is like ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/dataframe.py", line 152, in createTempView self._jdf.createTempView(name) File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__ File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o35.createTempView. : org.apache.spark.sql.catalyst.analysis.TempTableAlreadyExistsException: Temporary table 'people' already exists; at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTempView(SessionCatalog.scala:324) at org.apache.spark.sql.SparkSession.createTempView(SparkSession.scala:523) at org.apache.spark.sql.Dataset.createTempView(Dataset.scala:2328) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:211) at java.lang.Thread.run(Thread.java:745) ``` After this PR, the exception we will get become cleaner: ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/dataframe.py", line 152, in createTempView self._jdf.createTempView(name) File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__ File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py", line 75, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u"Temporary table 'people' already exists;" ``` #### How was this patch tested? Fixed an existing PySpark test case Author: gatorsmile <gatorsmile@gmail.com> Closes #13126 from gatorsmile/followup-14684.
* [DOC][MINOR] ml.feature Scala and Python API syncBryan Cutler2016-05-191-13/+26
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? I reviewed Scala and Python APIs for ml.feature and corrected discrepancies. ## How was this patch tested? Built docs locally, ran style checks Author: Bryan Cutler <cutlerb@gmail.com> Closes #13159 from BryanCutler/ml.feature-api-sync.
* [SPARK-14463][SQL] Document the semantics for read.textReynold Xin2016-05-181-0/+3
| | | | | | | | | | | | ## What changes were proposed in this pull request? This patch is a follow-up to https://github.com/apache/spark/pull/13104 and adds documentation to clarify the semantics of read.text with respect to partitioning. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #13184 from rxin/SPARK-14463.
* [SPARK-14891][ML] Add schema validation for ALSNick Pentreath2016-05-181-4/+4
| | | | | | | | | | | | | This PR adds schema validation to `ml`'s ALS and ALSModel. Currently, no schema validation was performed as `transformSchema` was never called in `ALS.fit` or `ALSModel.transform`. Furthermore, due to no schema validation, if users passed in Long (or Float etc) ids, they would be silently cast to Int with no warning or error thrown. With this PR, ALS now supports all numeric types for `user`, `item`, and `rating` columns. The rating column is cast to `Float` and the user and item cols are cast to `Int` (as is the case currently) - however for user/item, the cast throws an error if the value is outside integer range. Behavior for rating col is unchanged (as it is not an issue). ## How was this patch tested? New test cases in `ALSSuite`. Author: Nick Pentreath <nickp@za.ibm.com> Closes #12762 from MLnick/SPARK-14891-als-validate-schema.
* [SPARK-15342] [SQL] [PYSPARK] PySpark test for non ascii column name does ↵Liang-Chi Hsieh2016-05-182-3/+11
| | | | | | | | | | | | | | | | not actually test with unicode column name ## What changes were proposed in this pull request? The PySpark SQL `test_column_name_with_non_ascii` wants to test non-ascii column name. But it doesn't actually test it. We need to construct an unicode explicitly using `unicode` under Python 2. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13134 from viirya/correct-non-ascii-colname-pytest.
* [SPARK-14978][PYSPARK] PySpark TrainValidationSplitModel should support ↵Takuya Kuwahara2016-05-182-10/+53
| | | | | | | | | | | | | | | | validationMetrics ## What changes were proposed in this pull request? This pull request includes supporting validationMetrics for TrainValidationSplitModel with Python and test for it. ## How was this patch tested? test in `python/pyspark/ml/tests.py` Author: Takuya Kuwahara <taakuu19@gmail.com> Closes #12767 from taku-k/spark-14978.
* [SPARK-15171][SQL] Remove the references to deprecated method ↵Sean Zhong2016-05-184-16/+17
| | | | | | | | | | | | | | | | | dataset.registerTempTable ## What changes were proposed in this pull request? Update the unit test code, examples, and documents to remove calls to deprecated method `dataset.registerTempTable`. ## How was this patch tested? This PR only changes the unit test code, examples, and comments. It should be safe. This is a follow up of PR https://github.com/apache/spark/pull/12945 which was merged. Author: Sean Zhong <seanzhong@databricks.com> Closes #13098 from clockfly/spark-15171-remove-deprecation.