aboutsummaryrefslogtreecommitdiff
path: root/python
Commit message (Collapse)AuthorAgeFilesLines
...
* [SPARK-10278] [MLLIB] [PYSPARK] Add @since annotation to pyspark.mllib.treeYu ISHIKAWA2015-09-171-1/+35
| | | | | | Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8685 from yu-iskw/SPARK-10278.
* [SPARK-10281] [ML] [PYSPARK] [DOCS] Add @since annotation to ↵Yu ISHIKAWA2015-09-171-0/+13
| | | | | | | | pyspark.ml.clustering Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8691 from yu-iskw/SPARK-10281.
* [SPARK-10283] [ML] [PYSPARK] [DOCS] Add @since annotation to ↵Yu ISHIKAWA2015-09-171-0/+65
| | | | | | | | pyspark.ml.regression Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8693 from yu-iskw/SPARK-10283.
* [SPARK-10284] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.tuningYu ISHIKAWA2015-09-171-0/+28
| | | | | | Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8694 from yu-iskw/SPARK-10284.
* [SPARK-10276] [MLLIB] [PYSPARK] Add @since annotation to ↵Yu ISHIKAWA2015-09-161-1/+35
| | | | | | | | pyspark.mllib.recommendation Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8677 from yu-iskw/SPARK-10276.
* [SPARK-10516] [ MLLIB] Added values property in DenseVectorVinod K C2015-09-151-0/+4
| | | | | | Author: Vinod K C <vinod.kc@huawei.com> Closes #8682 from vinodkc/fix_SPARK-10516.
* [PYSPARK] [MLLIB] [DOCS] Replaced addversion with versionadded in mllib.randomnoelsmith2015-09-151-1/+1
| | | | | | | | Missed this when reviewing `pyspark.mllib.random` for SPARK-10275. Author: noelsmith <mail@noelsmith.com> Closes #8773 from noel-smith/mllib-random-versionadded-fix.
* [SPARK-10275] [MLLIB] Add @since annotation to pyspark.mllib.randomYu ISHIKAWA2015-09-141-0/+15
| | | | | | Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8666 from yu-iskw/SPARK-10275.
* [SPARK-10273] Add @since annotation to pyspark.mllib.featurenoelsmith2015-09-141-1/+57
| | | | | | | | | | Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings). Added since to methods + "versionadded::" to classes (derived from the git file history in pyspark). Author: noelsmith <mail@noelsmith.com> Closes #8633 from noel-smith/SPARK-10273-since-mllib-feature.
* [SPARK-9793] [MLLIB] [PYSPARK] PySpark DenseVector, SparseVector implement ↵Yanbo Liang2015-09-142-15/+107
| | | | | | | | | | | __eq__ and __hash__ correctly PySpark DenseVector, SparseVector ```__eq__``` method should use semantics equality, and DenseVector can compared with SparseVector. Implement PySpark DenseVector, SparseVector ```__hash__``` method based on the first 16 entries. That will make PySpark Vector objects can be used in collections. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8166 from yanboliang/spark-9793.
* [SPARK-10542] [PYSPARK] fix serialize namedtupleDavies Liu2015-09-143-1/+20
| | | | | | Author: Davies Liu <davies@databricks.com> Closes #8707 from davies/fix_namedtuple.
* [SPARK-10194] [MLLIB] [PYSPARK] SGD algorithms need convergenceTol parameter ↵Yanbo Liang2015-09-142-16/+33
| | | | | | | | | | in Python [SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382) added a ```convergenceTol``` parameter for GradientDescent-based methods in Scala. We need that parameter in Python; otherwise, Python users will not be able to adjust that behavior (or even reproduce behavior from previous releases since the default changed). Author: Yanbo Liang <ybliang8@gmail.com> Closes #8457 from yanboliang/spark-10194.
* [SPARK-6548] Adding stddev to DataFrame functionsJihongMa2015-09-121-18/+18
| | | | | | | | | | | Adding STDDEV support for DataFrame using 1-pass online /parallel algorithm to compute variance. Please review the code change. Author: JihongMa <linlin200605@gmail.com> Author: Jihong MA <linlin200605@gmail.com> Author: Jihong MA <jihongma@jihongs-mbp.usca.ibm.com> Author: Jihong MA <jihongma@Jihongs-MacBook-Pro.local> Closes #6297 from JihongMA/SPARK-SQL.
* [SPARK-9014] [SQL] Allow Python spark API to use built-in exponential operator0x0FFF2015-09-112-1/+14
| | | | | | | | | | | | | | | | | | | This PR addresses (SPARK-9014)[https://issues.apache.org/jira/browse/SPARK-9014] Added functionality: `Column` object in Python now supports exponential operator `**` Example: ``` from pyspark.sql import * df = sqlContext.createDataFrame([Row(a=2)]) df.select(3**df.a,df.a**3,df.a**df.a).collect() ``` Outputs: ``` [Row(POWER(3.0, a)=9.0, POWER(a, 3.0)=8.0, POWER(a, a)=4.0)] ``` Author: 0x0FFF <programmerag@gmail.com> Closes #8658 from 0x0FFF/SPARK-9014.
* [PYTHON] Fixed typo in exception messageIcaro Medeiros2015-09-111-1/+1
| | | | | | | | Just fixing a typo in exception message, raised when attempting to pickle SparkContext. Author: Icaro Medeiros <icaro.medeiros@gmail.com> Closes #8724 from icaromedeiros/master.
* [SPARK-8530] [ML] add python API for MinMaxScalerYuhao Yang2015-09-111-5/+99
| | | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-8530 add python API for MinMaxScaler jira for MinMaxScaler: https://issues.apache.org/jira/browse/SPARK-7514 Author: Yuhao Yang <hhbyyh@gmail.com> Closes #7150 from hhbyyh/pythonMinMax.
* [MINOR] [MLLIB] [ML] [DOC] Minor doc fixes for StringIndexer and MetadataUtilsJoseph K. Bradley2015-09-111-8/+8
| | | | | | | | | | | | Changes: * Make Scala doc for StringIndexerInverse clearer. Also remove Scala doc from transformSchema, so that the doc is inherited. * MetadataUtils.scala: “ Helper utilities for tree-based algorithms” —> not just trees anymore CC: holdenk mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8679 from jkbradley/doc-fixes-1.5.
* [SPARK-9773] [ML] [PySpark] Add Python API for MultilayerPerceptronClassifierYanbo Liang2015-09-111-1/+131
| | | | | | | | Add Python API for ```MultilayerPerceptronClassifier```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8067 from yanboliang/SPARK-9773.
* [SPARK-10026] [ML] [PySpark] Implement some common Params for regression in ↵Yanbo Liang2015-09-114-96/+143
| | | | | | | | | | | | | | | | | PySpark LinearRegression and LogisticRegression lack of some Params for Python, and some Params are not shared classes which lead we need to write them for each class. These kinds of Params are list here: ```scala HasElasticNetParam HasFitIntercept HasStandardization HasThresholds ``` Here we implement them in shared params at Python side and make LinearRegression/LogisticRegression parameters peer with Scala one. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8508 from yanboliang/spark-10026.
* [SPARK-10027] [ML] [PySpark] Add Python API missing methods for ml.featureYanbo Liang2015-09-103-8/+59
| | | | | | | | | | | Missing method of ml.feature are listed here: ```StringIndexer``` lacks of parameter ```handleInvalid```. ```StringIndexerModel``` lacks of method ```labels```. ```VectorIndexerModel``` lacks of methods ```numFeatures``` and ```categoryMaps```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8313 from yanboliang/spark-10027.
* [SPARK-7544] [SQL] [PySpark] pyspark.sql.types.Row implements __getitem__Yanbo Liang2015-09-101-0/+15
| | | | | | | | pyspark.sql.types.Row implements ```__getitem__``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #8333 from yanboliang/spark-7544.
* [SPARK-9772] [PYSPARK] [ML] Add Python API for ml.feature.VectorSlicerYanbo Liang2015-09-091-5/+90
| | | | | | | | Add Python API for ml.feature.VectorSlicer. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8102 from yanboliang/SPARK-9772.
* [SPARK-9654] [ML] [PYSPARK] Add IndexToString to PySparkHolden Karau2015-09-082-5/+72
| | | | | | | | Adds IndexToString to PySpark. Author: Holden Karau <holden@pigscanfly.ca> Closes #7976 from holdenk/SPARK-9654-add-string-indexer-inverse-in-pyspark.
* [SPARK-10094] Pyspark ML Feature transformers marked as experimentalnoelsmith2015-09-081-0/+52
| | | | | | | | Modified class-level docstrings to mark all feature transformers in pyspark.ml as experimental. Author: noelsmith <mail@noelsmith.com> Closes #8623 from noel-smith/SPARK-10094-mark-pyspark-ml-trans-exp.
* [SPARK-10373] [PYSPARK] move @since into pyspark from sqlDavies Liu2015-09-089-25/+23
| | | | | | | | cc mengxr Author: Davies Liu <davies@databricks.com> Closes #8657 from davies/move_since.
* [SPARK-10440] [STREAMING] [DOCS] Update python API stuff in the programming ↵Tathagata Das2015-09-042-0/+29
| | | | | | | | | | | guides and python docs - Fixed information around Python API tags in streaming programming guides - Added missing stuff in python docs Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8595 from tdas/SPARK-10440.
* [SPARK-10417] [SQL] Iterating through Column results in infinite loop0x0FFF2015-09-022-0/+12
| | | | | | | | | | | | | | `pyspark.sql.column.Column` object has `__getitem__` method, which makes it iterable for Python. In fact it has `__getitem__` to address the case when the column might be a list or dict, for you to be able to access certain element of it in DF API. The ability to iterate over it is just a side effect that might cause confusion for the people getting familiar with Spark DF (as you might iterate this way on Pandas DF for instance) Issue reproduction: ``` df = sqlContext.jsonRDD(sc.parallelize(['{"name": "El Magnifico"}'])) for i in df["name"]: print i ``` Author: 0x0FFF <programmerag@gmail.com> Closes #8574 from 0x0FFF/SPARK-10417.
* [SPARK-10392] [SQL] Pyspark - Wrong DateType support on JDBC connection0x0FFF2015-09-012-2/+9
| | | | | | | | | | | | | | | | | | | This PR addresses issue [SPARK-10392](https://issues.apache.org/jira/browse/SPARK-10392) The problem is that for "start of epoch" date (01 Jan 1970) PySpark class DateType returns 0 instead of the `datetime.date` due to implementation of its return statement Issue reproduction on master: ``` >>> from pyspark.sql.types import * >>> a = DateType() >>> a.fromInternal(0) 0 >>> a.fromInternal(1) datetime.date(1970, 1, 2) ``` Author: 0x0FFF <programmerag@gmail.com> Closes #8556 from 0x0FFF/SPARK-10392.
* [SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe filter ↵0x0FFF2015-09-012-10/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | function This PR addresses [SPARK-10162](https://issues.apache.org/jira/browse/SPARK-10162) The issue is with DataFrame filter() function, if datetime.datetime is passed to it: * Timezone information of this datetime is ignored * This datetime is assumed to be in local timezone, which depends on the OS timezone setting Fix includes both code change and regression test. Problem reproduction code on master: ```python import pytz from datetime import datetime from pyspark.sql import * from pyspark.sql.types import * sqc = SQLContext(sc) df = sqc.createDataFrame([], StructType([StructField("dt", TimestampType())])) m1 = pytz.timezone('UTC') m2 = pytz.timezone('Etc/GMT+3') df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain() df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain() ``` It gives the same timestamp ignoring time zone: ``` >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain() Filter (dt#0 > 946713600000000) Scan PhysicalRDD[dt#0] >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain() Filter (dt#0 > 946713600000000) Scan PhysicalRDD[dt#0] ``` After the fix: ``` >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain() Filter (dt#0 > 946684800000000) Scan PhysicalRDD[dt#0] >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain() Filter (dt#0 > 946695600000000) Scan PhysicalRDD[dt#0] ``` PR [8536](https://github.com/apache/spark/pull/8536) was occasionally closed by me dropping the repo Author: 0x0FFF <programmerag@gmail.com> Closes #8555 from 0x0FFF/SPARK-10162.
* [SPARK-9679] [ML] [PYSPARK] Add Python API for Stop Words RemoverHolden Karau2015-09-012-4/+89
| | | | | | | | Add a python API for the Stop Words Remover. Author: Holden Karau <holden@pigscanfly.ca> Closes #8118 from holdenk/SPARK-9679-python-StopWordsRemover.
* [SPARK-10355] [ML] [PySpark] Add Python API for SQLTransformerYanbo Liang2015-08-311-3/+54
| | | | | | | | Add Python API for SQLTransformer Author: Yanbo Liang <ybliang8@gmail.com> Closes #8527 from yanboliang/spark-10355.
* [SPARK-8472] [ML] [PySpark] Python API for DCTYanbo Liang2015-08-311-1/+64
| | | | | | | | Add Python API for ml.feature.DCT. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8485 from yanboliang/spark-8472.
* [SPARK-10188] [PYSPARK] Pyspark CrossValidator with RMSE selects incorrect modelnoelsmith2015-08-273-1/+104
| | | | | | | | | | | | | * Added isLargerBetter() method to Pyspark Evaluator to match the Scala version. * JavaEvaluator delegates isLargerBetter() to underlying Scala object. * Added check for isLargerBetter() in CrossValidator to determine whether to use argmin or argmax. * Added test cases for where smaller is better (RMSE) and larger is better (R-Squared). (This contribution is my original work and that I license the work to the project under Sparks' open source license) Author: noelsmith <mail@noelsmith.com> Closes #8399 from noel-smith/pyspark-rmse-xval-fix.
* [SPARK-9964] [PYSPARK] [SQL] PySpark DataFrameReader accept RDD of String ↵Yanbo Liang2015-08-261-6/+22
| | | | | | | | | | | for JSON PySpark DataFrameReader should could accept an RDD of Strings (like the Scala version does) for JSON, rather than only taking a path. If this PR is merged, it should be duplicated to cover the other input types (not just JSON). Author: Yanbo Liang <ybliang8@gmail.com> Closes #8444 from yanboliang/spark-9964.
* [SPARK-10305] [SQL] fix create DataFrame from Python classDavies Liu2015-08-262-0/+18
| | | | | | | | cc jkbradley Author: Davies Liu <davies@databricks.com> Closes #8470 from davies/fix_create_df.
* [SPARK-9613] [CORE] Ban use of JavaConversions and migrate all existing uses ↵Sean Owen2015-08-252-2/+14
| | | | | | | | | | | | to JavaConverters Replace `JavaConversions` implicits with `JavaConverters` Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet. Author: Sean Owen <sowen@cloudera.com> Closes #8033 from srowen/SPARK-9613.
* [SPARK-10168] [STREAMING] Fix the issue that maven publishes wrong artifact jarszsxwing2015-08-241-21/+26
| | | | | | | | | | | | | | This PR removed the `outputFile` configuration from pom.xml and updated `tests.py` to search jars for both sbt build and maven build. I ran ` mvn -Pkinesis-asl -DskipTests clean install` locally, and verified the jars in my local repository were correct. I also checked Python tests for maven build, and it passed all tests. Author: zsxwing <zsxwing@gmail.com> Closes #8373 from zsxwing/SPARK-10168 and squashes the following commits: e0b5818 [zsxwing] Fix the sbt build c697627 [zsxwing] Add the jar pathes to the exception message be1d8a5 [zsxwing] Fix the issue that maven publishes wrong artifact jars
* [SPARK-10142] [STREAMING] Made python checkpoint recovery handle non-local ↵Tathagata Das2015-08-232-16/+49
| | | | | | | | | | | | | | | | | | checkpoint paths and existing SparkContexts The current code only checks checkpoint files in local filesystem, and always tries to create a new Python SparkContext (even if one already exists). The solution is to do the following: 1. Use the same code path as Java to check whether a valid checkpoint exists 2. Create a new Python SparkContext only if there no active one. There is not test for the path as its hard to test with distributed filesystem paths in a local unit test. I am going to test it with a distributed file system manually to verify that this patch works. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8366 from tdas/SPARK-10142 and squashes the following commits: 3afa666 [Tathagata Das] Added tests 2dd4ae5 [Tathagata Das] Added the check to not create a context if one already exists 9bf151b [Tathagata Das] Made python checkpoint recovery use java to find the checkpoint files
* [SPARK-10122] [PYSPARK] [STREAMING] Fix getOffsetRanges bug in ↵jerryshao2015-08-212-2/+7
| | | | | | | | | | | | | | PySpark-Streaming transform function Details of the bug and explanations can be seen in [SPARK-10122](https://issues.apache.org/jira/browse/SPARK-10122). tdas , please help to review. Author: jerryshao <sshao@hortonworks.com> Closes #8347 from jerryshao/SPARK-10122 and squashes the following commits: 4039b16 [jerryshao] Fix getOffsetRanges in transform() bug
* [MINOR] [SQL] Fix sphinx warnings in PySpark SQLMechCoder2015-08-202-5/+7
| | | | | | Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8171 from MechCoder/sql_sphinx.
* [SPARK-9812] [STREAMING] Fix Python 3 compatibility issue in PySpark ↵zsxwing2015-08-193-3/+9
| | | | | | | | | | | | | Streaming and some docs This PR includes the following fixes: 1. Use `range` instead of `xrange` in `queue_stream.py` to support Python 3. 2. Fix the issue that `utf8_decoder` will return `bytes` rather than `str` when receiving an empty `bytes` in Python 3. 3. Fix the commands in docs so that the user can copy them directly to the command line. The previous commands was broken in the middle of a path, so when copying to the command line, the path would be split to two parts by the extra spaces, which forces the user to fix it manually. Author: zsxwing <zsxwing@gmail.com> Closes #8315 from zsxwing/SPARK-9812.
* [SPARK-10073] [SQL] Python withColumn should replace the old columnDavies Liu2015-08-192-6/+10
| | | | | | | | | | DataFrame.withColumn in Python should be consistent with the Scala one (replacing the existing column that has the same name). cc marmbrus Author: Davies Liu <davies@databricks.com> Closes #8300 from davies/with_column.
* [SPARK-10097] Adds `shouldMaximize` flag to `ml.evaluation.Evaluator`Feynman Liang2015-08-191-2/+2
| | | | | | | | | | | | | Previously, users of evaluator (`CrossValidator` and `TrainValidationSplit`) would only maximize the metric in evaluator, leading to a hacky solution which negated metrics to be minimized and caused erroneous negative values to be reported to the user. This PR adds a `isLargerBetter` attribute to the `Evaluator` base class, instructing users of `Evaluator` on whether the chosen metric should be maximized or minimized. CC jkbradley Author: Feynman Liang <fliang@databricks.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #8290 from feynmanliang/SPARK-10097.
* [DOCS] [SQL] [PYSPARK] Fix typo in ntile functionMoussa Taifi2015-08-191-1/+1
| | | | | | | | Fix typo in ntile function. Author: Moussa Taifi <moutai10@gmail.com> Closes #8261 from moutai/patch-2.
* [SPARK-9768] [PYSPARK] [ML] Add Python API and user guide for ↵Yanbo Liang2015-08-171-5/+62
| | | | | | | | | | ml.feature.ElementwiseProduct Add Python API, user guide and example for ml.feature.ElementwiseProduct. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8061 from yanboliang/SPARK-9768.
* [SPARK-9805] [MLLIB] [PYTHON] [STREAMING] Added _eventually for ml streaming ↵Joseph K. Bradley2015-08-151-48/+129
| | | | | | | | | | | | pyspark tests Recently, PySpark ML streaming tests have been flaky, most likely because of the batches not being processed in time. Proposal: Replace the use of _ssc_wait (which waits for a fixed amount of time) with a method which waits for a fixed amount of time but can terminate early based on a termination condition method. With this, we can extend the waiting period (to make tests less flaky) but also stop early when possible (making tests faster on average, which I verified locally). CC: mengxr tdas freeman-lab Author: Joseph K. Bradley <joseph@databricks.com> Closes #8087 from jkbradley/streaming-ml-tests.
* [SPARK-8670] [SQL] Nested columns can't be referenced in pysparkWenchen Fan2015-08-142-3/+3
| | | | | | | | This bug is caused by a wrong column-exist-check in `__getitem__` of pyspark dataframe. `DataFrame.apply` accepts not only top level column names, but also nested column name like `a.b`, so we should remove that check from `__getitem__`. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8202 from cloud-fan/nested.
* [SPARK-9978] [PYSPARK] [SQL] fix Window.orderBy and doc of ntile()Davies Liu2015-08-143-4/+28
| | | | | | Author: Davies Liu <davies@databricks.com> Closes #8213 from davies/fix_window.
* [SPARK-9828] [PYSPARK] Mutable values should not be default argumentsMechCoder2015-08-148-21/+50
| | | | | | Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8110 from MechCoder/spark-9828.
* [SPARK-8976] [PYSPARK] fix open mode in python3Davies Liu2015-08-131-1/+1
| | | | | | | | | | This bug only happen on Python 3 and Windows. I tested this manually with python 3 and disable python daemon, no unit test yet. Author: Davies Liu <davies@databricks.com> Closes #8181 from davies/open_mode.