aboutsummaryrefslogtreecommitdiff
path: root/python
Commit message (Collapse)AuthorAgeFilesLines
...
* [SPARK-15244] [PYTHON] Type of column name created with createDataFrame is ↵Dongjoon Hyun2016-05-172-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | not consistent. ## What changes were proposed in this pull request? **createDataFrame** returns inconsistent types for column names. ```python >>> from pyspark.sql.types import StructType, StructField, StringType >>> schema = StructType([StructField(u"col", StringType())]) >>> df1 = spark.createDataFrame([("a",)], schema) >>> df1.columns # "col" is str ['col'] >>> df2 = spark.createDataFrame([("a",)], [u"col"]) >>> df2.columns # "col" is unicode [u'col'] ``` The reason is only **StructField** has the following code. ``` if not isinstance(name, str): name = name.encode('utf-8') ``` This PR adds the same logic into **createDataFrame** for consistency. ``` if isinstance(schema, list): schema = [x.encode('utf-8') if not isinstance(x, str) else x for x in schema] ``` ## How was this patch tested? Pass the Jenkins test (with new python doctest) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13097 from dongjoon-hyun/SPARK-15244.
* [SPARK-14615][ML] Use the new ML Vector and Matrix in the ML pipeline based ↵DB Tsai2016-05-178-112/+94
| | | | | | | | | | | | | | | | | | algorithms ## What changes were proposed in this pull request? Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis. ## How was this patch tested? Unit tests Author: DB Tsai <dbt@netflix.com> Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Xiangrui Meng <meng@databricks.com> Closes #12627 from dbtsai/SPARK-14615-NewML.
* [SPARK-14906][ML] Copy linalg in PySpark to new ML packageXiangrui Meng2016-05-173-45/+1564
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Copy the linalg (Vector/Matrix and VectorUDT/MatrixUDT) in PySpark to new ML package. ## How was this patch tested? Existing tests. Author: Xiangrui Meng <meng@databricks.com> Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #13099 from viirya/move-pyspark-vector-matrix-udt4.
* [SPARK-15061][PYSPARK] Upgrade to Py4J 0.10.1Holden Karau2016-05-133-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This upgrades to Py4J 0.10.1 which reduces syscal overhead in Java gateway ( see https://github.com/bartdag/py4j/issues/201 ). Related https://issues.apache.org/jira/browse/SPARK-6728 . ## How was this patch tested? Existing doctests & unit tests pass Author: Holden Karau <holden@us.ibm.com> Closes #13064 from holdenk/SPARK-15061-upgrade-to-py4j-0.10.1.
* [SPARK-15181][ML][PYSPARK] Python API for GLR summaries.sethah2016-05-132-2/+238
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch adds a python API for generalized linear regression summaries (training and test). This helps provide feature parity for Python GLMs. ## How was this patch tested? Added a unit test to `pyspark.ml.tests` Author: sethah <seth.hendrickson16@gmail.com> Closes #12961 from sethah/GLR_summary.
* [MINOR][PYSPARK] update _shared_params_code_gen.pyZheng RuiFeng2016-05-133-10/+10
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? 1, add arg-checkings for `tol` and `stepSize` to keep in line with `SharedParamsCodeGen.scala` 2, fix one typo ## How was this patch tested? local build Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12996 from zhengruifeng/py_args_checking.
* [SPARK-15188] Add missing thresholds param to NaiveBayes in PySparkHolden Karau2016-05-131-5/+10
| | | | | | | | | | | | | ## What changes were proposed in this pull request? Add missing thresholds param to NiaveBayes ## How was this patch tested? doctests Author: Holden Karau <holden@us.ibm.com> Closes #12963 from holdenk/SPARK-15188-add-missing-naive-bayes-param.
* [SPARK-15171][SQL] Deprecate registerTempTable and add dataset.createTempViewSean Zhong2016-05-124-28/+59
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Deprecates registerTempTable and add dataset.createTempView, dataset.createOrReplaceTempView. ## How was this patch tested? Unit tests. Author: Sean Zhong <seanzhong@databricks.com> Closes #12945 from clockfly/spark-15171.
* [SPARK-15281][PYSPARK][ML][TRIVIAL] Add impurity param to GBTRegressor & add ↵Holden Karau2016-05-121-8/+44
| | | | | | | | | | | | | | | | experimental inside of regression.py ## What changes were proposed in this pull request? Add impurity param to GBTRegressor and mark the of the models & regressors in regression.py as experimental to match Scaladoc. ## How was this patch tested? Added default value to init, tested with unit/doc tests. Author: Holden Karau <holden@us.ibm.com> Closes #13071 from holdenk/SPARK-15281-GBTRegressor-impurity.
* [SPARK-15072][SQL][PYSPARK][HOT-FIX] Remove SparkSession.withHiveSupport ↵Yin Huai2016-05-111-1/+1
| | | | | | | | | | | from readwrite.py ## What changes were proposed in this pull request? Seems https://github.com/apache/spark/commit/db573fc743d12446dd0421fb45d00c2f541eaf9a did not remove withHiveSupport from readwrite.py Author: Yin Huai <yhuai@databricks.com> Closes #13069 from yhuai/fixPython.
* [SPARK-15072][SQL][PYSPARK] FollowUp: Remove SparkSession.withHiveSupport in ↵Sandeep Singh2016-05-112-11/+3
| | | | | | | | | | | | | | | PySpark ## What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/12851 Remove `SparkSession.withHiveSupport` in PySpark and instead use `SparkSession.builder. enableHiveSupport` ## How was this patch tested? Existing tests. Author: Sandeep Singh <sandeep@techaddict.me> Closes #13063 from techaddict/SPARK-15072-followup.
* [SPARK-15264][SPARK-15274][SQL] CSV Reader Error on Blank Column NamesBill Chambers2016-05-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? When a CSV begins with: - `,,` OR - `"","",` meaning that the first column names are either empty or blank strings and `header` is specified to be `true`, then the column name is replaced with `C` + the index number of that given column. For example, if you were to read in the CSV: ``` "","second column" "hello", "there" ``` Then column names would become `"C0", "second column"`. This behavior aligns with what currently happens when `header` is specified to be `false` in recent versions of Spark. ### Current Behavior in Spark <=1.6 In Spark <=1.6, a CSV with a blank column name becomes a blank string, `""`, meaning that this column cannot be accessed. However the CSV reads in without issue. ### Current Behavior in Spark 2.0 Spark throws a NullPointerError and will not read in the file. #### Reproduction in 2.0 https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/346304/2828750690305044/484361/latest.html ## How was this patch tested? A new test was added to `CSVSuite` to account for this issue. We then have asserts that test for being able to select both the empty column names as well as the regular column names. Author: Bill Chambers <bill@databricks.com> Author: Bill Chambers <wchambers@ischool.berkeley.edu> Closes #13041 from anabranch/master.
* [SPARK-15256] [SQL] [PySpark] Clarify DataFrameReader.jdbc() docstringNicholas Chammas2016-05-111-32/+35
| | | | | | | | | | | This PR: * Corrects the documentation for the `properties` parameter, which is supposed to be a dictionary and not a list. * Generally clarifies the Python docstring for DataFrameReader.jdbc() by pulling from the [Scala docstrings](https://github.com/apache/spark/blob/b28137764716f56fa1a923c4278624a56364a505/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L201-L251) and rephrasing things. * Corrects minor Sphinx typos. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #13034 from nchammas/SPARK-15256.
* [SPARK-15278] [SQL] Remove experimental tag from Python DataFrameReynold Xin2016-05-112-4/+2
| | | | | | | | | | | | ## What changes were proposed in this pull request? Earlier we removed experimental tag for Scala/Java DataFrames, but haven't done so for Python. This patch removes the experimental flag for Python and declares them stable. ## How was this patch tested? N/A. Author: Reynold Xin <rxin@databricks.com> Closes #13062 from rxin/SPARK-15278.
* [SPARK-15270] [SQL] Use SparkSession Builder to build a session with HiveSupportSandeep Singh2016-05-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Before: Creating a hiveContext was failing ```python from pyspark.sql import HiveContext hc = HiveContext(sc) ``` with ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "spark-2.0/python/pyspark/sql/context.py", line 458, in __init__ sparkSession = SparkSession.withHiveSupport(sparkContext) File "spark-2.0/python/pyspark/sql/session.py", line 192, in withHiveSupport jsparkSession = sparkContext._jvm.SparkSession.withHiveSupport(sparkContext._jsc.sc()) File "spark-2.0/python/lib/py4j-0.9.2-src.zip/py4j/java_gateway.py", line 1048, in __getattr__ py4j.protocol.Py4JError: org.apache.spark.sql.SparkSession.withHiveSupport does not exist in the JVM ``` Now: ```python >>> from pyspark.sql import HiveContext >>> hc = HiveContext(sc) >>> hc.range(0, 100) DataFrame[id: bigint] >>> hc.range(0, 100).count() 100 ``` ## How was this patch tested? Existing Tests, tested manually in python shell Author: Sandeep Singh <sandeep@techaddict.me> Closes #13056 from techaddict/SPARK-15270.
* [SPARK-12200][SQL] Add __contains__ implementation to RowMaciej Brynski2016-05-111-1/+21
| | | | | | | | | https://issues.apache.org/jira/browse/SPARK-12200 Author: Maciej Brynski <maciej.brynski@adpilot.pl> Author: Maciej Bryński <maciek-github@brynski.pl> Closes #10194 from maver1ck/master.
* [SPARK-15085][STREAMING][KAFKA] Rename streaming-kafka artifactcody koeninger2016-05-112-6/+6
| | | | | | | | | | | | ## What changes were proposed in this pull request? Renaming the streaming-kafka artifact to include kafka version, in anticipation of needing a different artifact for later kafka versions ## How was this patch tested? Unit tests Author: cody koeninger <cody@koeninger.org> Closes #12946 from koeninger/SPARK-15085.
* [SPARK-15037] [SQL] [MLLIB] Part2: Use SparkSession instead of SQLContext in ↵Sandeep Singh2016-05-114-294/+273
| | | | | | | | | | | | | | Python TestSuites ## What changes were proposed in this pull request? Use SparkSession instead of SQLContext in Python TestSuites ## How was this patch tested? Existing tests Author: Sandeep Singh <sandeep@techaddict.me> Closes #13044 from techaddict/SPARK-15037-python.
* [SPARK-15189][PYSPARK][DOCS] Update ml.evaluation PyDocHolden Karau2016-05-111-1/+12
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fix doctest issue, short param description, and tag items as Experimental ## How was this patch tested? build docs locally & doctests Author: Holden Karau <holden@us.ibm.com> Closes #12964 from holdenk/SPARK-15189-ml.Evaluation-PyDoc-issues.
* [SPARK-15250][SQL] Remove deprecated json API in DataFrameReaderhyukjinkwon2016-05-101-2/+2
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR removes the old `json(path: String)` API which is covered by the new `json(paths: String*)`. ## How was this patch tested? Jenkins tests (existing tests should cover this) Author: hyukjinkwon <gurwls223@gmail.com> Author: Hyukjin Kwon <gurwls223@gmail.com> Closes #13040 from HyukjinKwon/SPARK-15250.
* [SPARK-15261][SQL] Remove experimental tag from DataFrameReader/WriterReynold Xin2016-05-101-5/+9
| | | | | | | | | | | | ## What changes were proposed in this pull request? This patch removes experimental tag from DataFrameReader and DataFrameWriter, and explicitly tags a few methods added for structured streaming as experimental. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #13038 from rxin/SPARK-15261.
* [SPARK-14936][BUILD][TESTS] FlumePollingStreamSuite is slowXin Ren2016-05-101-1/+1
| | | | | | | | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-14936 ## What changes were proposed in this pull request? FlumePollingStreamSuite contains two tests which run for a minute each. This seems excessively slow and we should speed it up if possible. In this PR, instead of creating `StreamingContext` directly from `conf`, here an underlying `SparkContext` is created before all and it is used to create each`StreamingContext`. Running time is reduced by avoiding multiple `SparkContext` creations and destroys. ## How was this patch tested? Tested on my local machine running `testOnly *.FlumePollingStreamSuite` Author: Xin Ren <iamshrek@126.com> Closes #12845 from keypointt/SPARK-14936.
* [SPARK-15195][PYSPARK][DOCS] Update ml.tuning PyDocsHolden Karau2016-05-101-1/+15
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Tag classes in ml.tuning as experimental, add docs for kfolds avg metric, and copy TrainValidationSplit scaladoc for more detailed explanation. ## How was this patch tested? built docs locally Author: Holden Karau <holden@us.ibm.com> Closes #12967 from holdenk/SPARK-15195-pydoc-ml-tuning.
* [SPARK-14603][SQL] Verification of Metadata Operations by Session Cataloggatorsmile2016-05-101-0/+2
| | | | | | | | | | | | | | | | | | Since we cannot really trust if the underlying external catalog can throw exceptions when there is an invalid metadata operation, let's do it in SessionCatalog. - [X] The first step is to unify the error messages issued in Hive-specific Session Catalog and general Session Catalog. - [X] The second step is to verify the inputs of metadata operations for partitioning-related operations. This is moved to a separate PR: https://github.com/apache/spark/pull/12801 - [X] The third step is to add database existence verification in `SessionCatalog` - [X] The fourth step is to add table existence verification in `SessionCatalog` - [X] The fifth step is to add function existence verification in `SessionCatalog` Add test cases and verify the error messages we issued Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #12385 from gatorsmile/verifySessionAPIs.
* [SPARK-15136][PYSPARK][DOC] Fix links to sphinx style and add a default ↵Holden Karau2016-05-095-25/+40
| | | | | | | | | | | | | | | | param doc note ## What changes were proposed in this pull request? PyDoc links in ml are in non-standard format. Switch to standard sphinx link format for better formatted documentation. Also add a note about default value in one place. Copy some extended docs from scala for GBT ## How was this patch tested? Built docs locally. Author: Holden Karau <holden@us.ibm.com> Closes #12918 from holdenk/SPARK-15137-linkify-pyspark-ml-classification.
* [SPARK-14050][ML] Add multiple languages support and additional methods for ↵Burak Köse2016-05-062-16/+29
| | | | | | | | | | | | | | | | | | | | | | | | | Stop Words Remover ## What changes were proposed in this pull request? This PR continues the work from #11871 with the following changes: * load English stopwords as default * covert stopwords to list in Python * update some tests and doc ## How was this patch tested? Unit tests. Closes #11871 cc: burakkose srowen Author: Burak Köse <burakks41@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Author: Burak KOSE <burakks41@gmail.com> Closes #12843 from mengxr/SPARK-14050.
* [SPARK-15106][PYSPARK][ML] Add PySpark package doc for ML component & remove ↵Holden Karau2016-05-051-0/+4
| | | | | | | | | | | | | | | | "BETA" ## What changes were proposed in this pull request? Copy the package documentation from Scala/Java to Python for ML package and remove beta tags. Not super sure if we want to keep the BETA tag but since we are making it the default it seems like probably the time to remove it (happy to put it back in if we want to keep it BETA). ## How was this patch tested? Python documentation built locally as HTML and text and verified output. Author: Holden Karau <holden@us.ibm.com> Closes #12883 from holdenk/SPARK-15106-add-pyspark-package-doc-for-ml.
* [MINOR] remove dead codeDavies Liu2016-05-041-9/+0
|
* [SPARK-14896][SQL] Deprecate HiveContext in pythonAndrew Or2016-05-043-4/+9
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? See title. ## How was this patch tested? PySpark tests. Author: Andrew Or <andrew@databricks.com> Closes #12917 from andrewor14/deprecate-hive-context-python.
* [SPARK-15031][EXAMPLE] Use SparkSession in Scala/Python/Java example.Dongjoon Hyun2016-05-041-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR aims to update Scala/Python/Java examples by replacing `SQLContext` with newly added `SparkSession`. - Use **SparkSession Builder Pattern** in 154(Scala 55, Java 52, Python 47) files. - Add `getConf` in Python SparkContext class: `python/pyspark/context.py` - Replace **SQLContext Singleton Pattern** with **SparkSession Singleton Pattern**: - `SqlNetworkWordCount.scala` - `JavaSqlNetworkWordCount.java` - `sql_network_wordcount.py` Now, `SQLContexts` are used only in R examples and the following two Python examples. The python examples are untouched in this PR since it already fails some unknown issue. - `simple_params_example.py` - `aft_survival_regression.py` ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12809 from dongjoon-hyun/SPARK-15031.
* [SPARK-15126][SQL] RuntimeConfig.set should return UnitReynold Xin2016-05-042-4/+0
| | | | | | | | | | | | ## What changes were proposed in this pull request? Currently we return RuntimeConfig itself to facilitate chaining. However, it makes the output in interactive environments (e.g. notebooks, scala repl) weird because it'd show the response of calling set as a RuntimeConfig itself. ## How was this patch tested? Updated unit tests. Author: Reynold Xin <rxin@databricks.com> Closes #12902 from rxin/SPARK-15126.
* [SPARK-15084][PYTHON][SQL] Use builder pattern to create SparkSession in ↵Dongjoon Hyun2016-05-031-1/+90
| | | | | | | | | | | | | | | | PySpark. ## What changes were proposed in this pull request? This is a python port of corresponding Scala builder pattern code. `sql.py` is modified as a target example case. ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12860 from dongjoon-hyun/SPARK-15084.
* [SPARK-9819][STREAMING][DOCUMENTATION] Clarify doc for invReduceFunc in ↵François Garillot2016-05-031-1/+3
| | | | | | | | | | | | incremental versions of reduceByWindow - that reduceFunc and invReduceFunc should be associative - that the intermediate result in iterated applications of inverseReduceFunc is its first argument Author: François Garillot <francois@garillot.net> Closes #8103 from huitseeker/issue/invReduceFuncDoc.
* [SPARK-14716][SQL] Added support for partitioning in FileStreamSinkTathagata Das2016-05-031-2/+2
| | | | | | | | | | | | | | | | | # What changes were proposed in this pull request? Support partitioning in the file stream sink. This is implemented using a new, but simpler code path for writing parquet files - both unpartitioned and partitioned. This new code path does not use Output Committers, as we will eventually write the file names to the metadata log for "committing" them. This patch duplicates < 100 LOC from the WriterContainer. But its far simpler that WriterContainer as it does not involve output committing. In addition, it introduces the new APIs in FileFormat and OutputWriterFactory in an attempt to simplify the APIs (not have Job in the `FileFormat` API, not have bucket and other stuff in the `OutputWriterFactory.newInstance()` ). # Tests - New unit tests to test the FileStreamSinkWriter for partitioned and unpartitioned files - New unit test to partially test the FileStreamSink for partitioned files (does not test recovery of partition column data, as that requires change in the StreamFileCatalog, future PR). - Updated FileStressSuite to test number of records read from partitioned output files. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #12409 from tdas/streaming-partitioned-parquet.
* [SPARK-14971][ML][PYSPARK] PySpark ML Params setter code clean upYanbo Liang2016-05-0310-219/+110
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? PySpark ML Params setter code clean up. For examples, ```setInputCol``` can be simplified from ``` self._set(inputCol=value) return self ``` to: ``` return self._set(inputCol=value) ``` This is a pretty big sweeps, and we cleaned wherever possible. ## How was this patch tested? Exist unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #12749 from yanboliang/spark-14971.
* [SPARK-15050][SQL] Put CSV and JSON options as Python csv and json function ↵hyukjinkwon2016-05-021-77/+155
| | | | | | | | | | | | | | | | | | | parameters ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-15050 This PR adds function parameters for Python API for reading and writing `csv()`. ## How was this patch tested? This was tested by `./dev/run_tests`. Author: hyukjinkwon <gurwls223@gmail.com> Author: Hyukjin Kwon <gurwls223@gmail.com> Closes #12834 from HyukjinKwon/SPARK-15050.
* [SPARK-13425][SQL] Documentation for CSV datasource optionshyukjinkwon2016-05-011-0/+52
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds the explanation and documentation for CSV options for reading and writing. ## How was this patch tested? Style tests with `./dev/run_tests` for documentation style. Author: hyukjinkwon <gurwls223@gmail.com> Author: Hyukjin Kwon <gurwls223@gmail.com> Closes #12817 from HyukjinKwon/SPARK-13425.
* [SPARK-14931][ML][PYTHON] Mismatched default values between pipelines in ↵Xusen Yin2016-05-015-12/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Spark and PySpark - update ## What changes were proposed in this pull request? This PR is an update for [https://github.com/apache/spark/pull/12738] which: * Adds a generic unit test for JavaParams wrappers in pyspark.ml for checking default Param values vs. the defaults in the Scala side * Various fixes for bugs found * This includes changing classes taking weightCol to treat unset and empty String Param values the same way. Defaults changed: * Scala * LogisticRegression: weightCol defaults to not set (instead of empty string) * StringIndexer: labels default to not set (instead of empty array) * GeneralizedLinearRegression: * maxIter always defaults to 25 (simpler than defaulting to 25 for a particular solver) * weightCol defaults to not set (instead of empty string) * LinearRegression: weightCol defaults to not set (instead of empty string) * Python * MultilayerPerceptron: layers default to not set (instead of [1,1]) * ChiSqSelector: numTopFeatures defaults to 50 (instead of not set) ## How was this patch tested? Generic unit test. Manually tested that unit test by changing defaults and verifying that broke the test. Author: Joseph K. Bradley <joseph@databricks.com> Author: yinxusen <yinxusen@gmail.com> Closes #12816 from jkbradley/yinxusen-SPARK-14931.
* [SPARK-14952][CORE][ML] Remove methods that were deprecated in 1.6.0Herman van Hovell2016-04-302-19/+0
| | | | | | | | | | | | | | | | | | #### What changes were proposed in this pull request? This PR removes three methods the were deprecated in 1.6.0: - `PortableDataStream.close()` - `LinearRegression.weights` - `LogisticRegression.weights` The rationale for doing this is that the impact is small and that Spark 2.0 is a major release. #### How was this patch tested? Compilation succeded. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12732 from hvanhovell/SPARK-14952.
* [SPARK-13289][MLLIB] Fix infinite distances between word vectors in ↵Junyang2016-04-301-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Word2VecModel ## What changes were proposed in this pull request? This PR fixes the bug that generates infinite distances between word vectors. For example, Before this PR, we have ``` val synonyms = model.findSynonyms("who", 40) ``` will give the following results: ``` to Infinity and Infinity that Infinity with Infinity ``` With this PR, the distance between words is a value between 0 and 1, as follows: ``` scala> model.findSynonyms("who", 10) res0: Array[(String, Double)] = Array((Harvard-educated,0.5253688097000122), (ex-SAS,0.5213794708251953), (McMutrie,0.5187736749649048), (fellow,0.5166833400726318), (businessman,0.5145374536514282), (American-born,0.5127736330032349), (British-born,0.5062344074249268), (gray-bearded,0.5047978162765503), (American-educated,0.5035858750343323), (mentored,0.49849334359169006)) scala> model.findSynonyms("king", 10) res1: Array[(String, Double)] = Array((queen,0.6787897944450378), (prince,0.6786158084869385), (monarch,0.659771203994751), (emperor,0.6490438580513), (goddess,0.643266499042511), (dynasty,0.635733425617218), (sultan,0.6166239380836487), (pharaoh,0.6150713562965393), (birthplace,0.6143025159835815), (empress,0.6109727025032043)) scala> model.findSynonyms("queen", 10) res2: Array[(String, Double)] = Array((princess,0.7670737504959106), (godmother,0.6982434988021851), (raven-haired,0.6877717971801758), (swan,0.684934139251709), (hunky,0.6816608309745789), (Titania,0.6808111071586609), (heroine,0.6794036030769348), (king,0.6787897944450378), (diva,0.67848801612854), (lip-synching,0.6731793284416199)) ``` ### There are two places changed in this PR: - Normalize the word vector to avoid overflow when calculating inner product between word vectors. This also simplifies the distance calculation, since the word vectors only need to be normalized once. - Scale the learning rate by number of iteration, to be consistent with Google Word2Vec implementation ## How was this patch tested? Use word2vec to train text corpus, and run model.findSynonyms() to get the distances between word vectors. Author: Junyang <fly.shenjy@gmail.com> Author: flyskyfly <fly.shenjy@gmail.com> Closes #11812 from flyjy/TVec.
* [SPARK-14412][.2][ML] rename *RDDStorageLevel to *StorageLevel in ml.ALSXiangrui Meng2016-04-302-40/+40
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? As discussed in #12660, this PR renames * intermediateRDDStorageLevel -> intermediateStorageLevel * finalRDDStorageLevel -> finalStorageLevel The argument name in `ALS.train` will be addressed in SPARK-15027. ## How was this patch tested? Existing unit tests. Author: Xiangrui Meng <meng@databricks.com> Closes #12803 from mengxr/SPARK-14412.
* [SPARK-14412][ML][PYSPARK] Add StorageLevel params to ALSNick Pentreath2016-04-292-5/+80
| | | | | | | | | | | | | | `mllib` `ALS` supports `setIntermediateRDDStorageLevel` and `setFinalRDDStorageLevel`. This PR adds these as Params in `ml` `ALS`. They are put in group **expertParam** since few users will need them. ## How was this patch tested? New test cases in `ALSSuite` and `tests.py`. cc yanboliang jkbradley sethah rishabhbhardwaj Author: Nick Pentreath <nickp@za.ibm.com> Closes #12660 from MLnick/SPARK-14412-als-storage-params.
* [SPARK-13786][ML][PYTHON] Removed save/load for python tuningJoseph K. Bradley2016-04-292-262/+21
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Per discussion on [https://github.com/apache/spark/pull/12604], this removes ML persistence for Python tuning (TrainValidationSplit, CrossValidator, and their Models) since they do not handle nesting easily. This support should be re-designed and added in the next release. ## How was this patch tested? Removed unit test elements saving and loading the tuning algorithms, but kept tests to save and load their bestModel fields. Author: Joseph K. Bradley <joseph@databricks.com> Closes #12782 from jkbradley/remove-python-tuning-saveload.
* [SPARK-15012][SQL] Simplify configuration API furtherAndrew Or2016-04-293-33/+4
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? 1. Remove all the `spark.setConf` etc. Just expose `spark.conf` 2. Make `spark.conf` take in things set in the core `SparkConf` as well, otherwise users may get confused This was done for both the Python and Scala APIs. ## How was this patch tested? `SQLConfSuite`, python tests. This one fixes the failed tests in #12787 Closes #12787 Author: Andrew Or <andrew@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #12798 from yhuai/conf-api.
* [SPARK-14988][PYTHON] SparkSession API follow-upsAndrew Or2016-04-295-209/+228
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Addresses comments in #12765. ## How was this patch tested? Python tests. Author: Andrew Or <andrew@databricks.com> Closes #12784 from andrewor14/python-followup.
* [SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA PR2Jeff Zhang2016-04-292-2/+543
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? pyspark.ml API for LDA * LDA, LDAModel, LocalLDAModel, DistributedLDAModel * includes persistence This replaces [https://github.com/apache/spark/pull/10242] ## How was this patch tested? * doc test for LDA, including Param setters * unit test for persistence Author: Joseph K. Bradley <joseph@databricks.com> Author: Jeff Zhang <zjffdu@apache.org> Closes #12723 from jkbradley/zjffdu-SPARK-11940.
* [SPARK-14988][PYTHON] SparkSession catalog and conf APIAndrew Or2016-04-294-85/+605
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? The `catalog` and `conf` APIs were exposed in `SparkSession` in #12713 and #12669. This patch adds those to the python API. ## How was this patch tested? Python tests. Author: Andrew Or <andrew@databricks.com> Closes #12765 from andrewor14/python-spark-session-more.
* [SPARK-14829][MLLIB] Deprecate GLM APIs using SGDZheng RuiFeng2016-04-282-0/+25
| | | | | | | | | | | | ## What changes were proposed in this pull request? According to the [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829), deprecate API of LogisticRegression and LinearRegression using SGD ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12596 from zhengruifeng/deprecate_sgd.
* [SPARK-14555] Second cut of Python API for Structured StreamingBurak Yavuz2016-04-285-46/+217
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds Python APIs for: - `ContinuousQueryManager` - `ContinuousQueryException` The `ContinuousQueryException` is a very basic wrapper, it doesn't provide the functionality that the Scala side provides, but it follows the same pattern for `AnalysisException`. For `ContinuousQueryManager`, all APIs are provided except for registering listeners. This PR also attempts to fix test flakiness by stopping all active streams just before tests. ## How was this patch tested? Python Doc tests and unit tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #12673 from brkyvz/pyspark-cqm.
* [SPARK-12810][PYSPARK] PySpark CrossValidatorModel should support avgMetricsKai Jiang2016-04-282-8/+37
| | | | | | | | | | | | ## What changes were proposed in this pull request? support avgMetrics in CrossValidatorModel with Python ## How was this patch tested? Doctest and `test_save_load` in `pyspark/ml/test.py` [JIRA](https://issues.apache.org/jira/browse/SPARK-12810) Author: Kai Jiang <jiangkai@gmail.com> Closes #12464 from vectorijk/spark-12810.