aboutsummaryrefslogtreecommitdiff
path: root/python
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-10957] [ML] setParams changes quantileProbabilities unexpectly in ↵Xiangrui Meng2015-10-061-5/+1
| | | | | | | | | | PySpark's AFTSurvivalRegression If user doesn't specify `quantileProbs` in `setParams`, it will get reset to the default value. We don't need special handling here. vectorijk yanboliang Author: Xiangrui Meng <meng@databricks.com> Closes #9001 from mengxr/SPARK-10957.
* [SPARK-10688] [ML] [PYSPARK] Python API for AFTSurvivalRegressionvectorijk2015-10-061-2/+169
| | | | | | | | Implement Python API for AFTSurvivalRegression Author: vectorijk <jiangkai@gmail.com> Closes #8926 from vectorijk/spark-10688.
* [SPARK-10782] [PYTHON] Update dropDuplicates documentationasokadiggs2015-09-291-0/+2
| | | | | | | | Documentation for dropDuplicates() and drop_duplicates() is one and the same. Resolved the error in the example for drop_duplicates using the same approach used for groupby and groupBy, by indicating that dropDuplicates and drop_duplicates are aliases. Author: asokadiggs <asoka.diggs@intel.com> Closes #8930 from asokadiggs/jira-10782.
* [SPARK-6919] [PYSPARK] Add asDict method to StatCounterErik Shilts2015-09-292-0/+42
| | | | | | | | | | | | Add method to easily convert a StatCounter instance into a Python dict https://issues.apache.org/jira/browse/SPARK-6919 Note: This is my original work and the existing Spark license applies. Author: Erik Shilts <erik.shilts@opower.com> Closes #5516 from eshilts/statcounter-asdict.
* [SPARK-10415] [PYSPARK] [MLLIB] [DOCS] Enhance Navigation Sidebar in PySpark APInoelsmith2015-09-294-2/+197
| | | | | | | | | | | | | | | | | | These are CSS/JavaScript changes changes to make navigation in the PySpark API a bit simpler by adding the following to the sidebar: * Classes * Functions * Tags to highlight experimental features ![screen shot 2015-09-02 at 08 50 12](https://cloud.githubusercontent.com/assets/11915197/9634781/301f853a-518b-11e5-8d5c-fda202f6202f.png) Online example here: https://dl.dropboxusercontent.com/u/20821334/pyspark-api-nav-enhance/pyspark.mllib.html (The contribution is my original work and that I license the work to the project under the project's open source license) Author: noelsmith <mail@noelsmith.com> Closes #8571 from noel-smith/pyspark-api-nav-enhance.
* [SPARK-9681] [ML] Support R feature interactions in RFormulaEric Liang2015-09-251-1/+1
| | | | | | | | | | | | This integrates the Interaction feature transformer with SparkR R formula support (i.e. support `:`). To generate reasonable ML attribute names for feature interactions, it was necessary to add the ability to read attribute the original attribute names back from `StructField`, and also to specify custom group prefixes in `VectorAssembler`. This also has the side-benefit of cleaning up the double-underscores in the attributes generated for non-interaction terms. mengxr Author: Eric Liang <ekl@databricks.com> Closes #8830 from ericl/interaction-2.
* [SPARK-10731] [SQL] Delegate to Scala's DataFrame.take implementation in ↵Reynold Xin2015-09-231-1/+4
| | | | | | | | | | | | Python DataFrame. Python DataFrame.head/take now requires scanning all the partitions. This pull request changes them to delegate the actual implementation to Scala DataFrame (by calling DataFrame.take). This is more of a hack for fixing this issue in 1.5.1. A more proper fix is to change executeCollect and executeTake to return InternalRow rather than Row, and thus eliminate the extra round-trip conversion. Author: Reynold Xin <rxin@databricks.com> Closes #8876 from rxin/SPARK-10731.
* [SPARK-10446][SQL] Support to specify join type when calling join with ↵Liang-Chi Hsieh2015-09-211-1/+5
| | | | | | | | | | | | usingColumns JIRA: https://issues.apache.org/jira/browse/SPARK-10446 Currently the method `join(right: DataFrame, usingColumns: Seq[String])` only supports inner join. It is more convenient to have it support other join types. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #8600 from viirya/usingcolumns_df.
* [SPARK-10577] [PYSPARK] DataFrame hint for broadcast joinJian Feng2015-09-212-0/+27
| | | | | | | | https://issues.apache.org/jira/browse/SPARK-10577 Author: Jian Feng <jzhang.chs@gmail.com> Closes #8801 from Jianfeng-chs/master.
* [SPARK-10716] [BUILD] spark-1.5.0-bin-hadoop2.6.tgz file doesn't uncompress ↵Sean Owen2015-09-211-0/+0
| | | | | | | | | | on OS X due to hidden file Remove ._SUCCESS.crc hidden file that may cause problems in distribution tar archive, and is not used Author: Sean Owen <sowen@cloudera.com> Closes #8846 from srowen/SPARK-10716.
* [SPARK-9821] [PYSPARK] pyspark-reduceByKey-should-take-a-custom-partitionerHolden Karau2015-09-211-13/+16
| | | | | | | | | | | | | | | from the issue: In Scala, I can supply a custom partitioner to reduceByKey (and other aggregation/repartitioning methods like aggregateByKey and combinedByKey), but as far as I can tell from the Pyspark API, there's no way to do the same in Python. Here's an example of my code in Scala: weblogs.map(s => (getFileType(s), 1)).reduceByKey(new FileTypePartitioner(),_+_) But I can't figure out how to do the same in Python. The closest I can get is to call repartition before reduceByKey like so: weblogs.map(lambda s: (getFileType(s), 1)).partitionBy(3,hash_filetype).reduceByKey(lambda v1,v2: v1+v2).collect() But that defeats the purpose, because I'm shuffling twice instead of once, so my performance is worse instead of better. Author: Holden Karau <holden@pigscanfly.ca> Closes #8569 from holdenk/SPARK-9821-pyspark-reduceByKey-should-take-a-custom-partitioner.
* [DOC] [PYSPARK] [MLLIB] Added newlines to docstrings to fix parameter formattingnoelsmith2015-09-218-1/+14
| | | | | | | | | | | | | | Added newlines before `:param ...:` and `:return:` markup. Without these, parameter lists aren't formatted correctly in the API docs. I.e: ![screen shot 2015-09-21 at 21 49 26](https://cloud.githubusercontent.com/assets/11915197/10004686/de3c41d4-60aa-11e5-9c50-a46dcb51243f.png) .. looks like this once newline is added: ![screen shot 2015-09-21 at 21 50 14](https://cloud.githubusercontent.com/assets/11915197/10004706/f86bfb08-60aa-11e5-8524-ae4436713502.png) Author: noelsmith <mail@noelsmith.com> Closes #8851 from noel-smith/docstring-missing-newline-fix.
* [SPARK-9769] [ML] [PY] add python api for countvectorizermodelHolden Karau2015-09-211-6/+142
| | | | | | | | From JIRA: Add Python API, user guide and example for ml.feature.CountVectorizerModel Author: Holden Karau <holden@pigscanfly.ca> Closes #8561 from holdenk/SPARK-9769-add-python-api-for-countvectorizermodel.
* [SPARK-10631] [DOCUMENTATION, MLLIB, PYSPARK] Added documentation for few APIsvinodkc2015-09-201-5/+17
| | | | | | | | There are some missing API docs in pyspark.mllib.linalg.Vector (including DenseVector and SparseVector). We should add them based on their Scala counterparts. Author: vinodkc <vinod.kc.in@gmail.com> Closes #8834 from vinodkc/fix_SPARK-10631.
* [SPARK-10710] Remove ability to disable spilling in core and SQLJosh Rosen2015-09-193-60/+8
| | | | | | | | | | It does not make much sense to set `spark.shuffle.spill` or `spark.sql.planner.externalSort` to false: I believe that these configurations were initially added as "escape hatches" to guard against bugs in the external operators, but these operators are now mature and well-tested. In addition, these configurations are not handled in a consistent way anymore: SQL's Tungsten codepath ignores these configurations and will continue to use spilling operators. Similarly, Spark Core's `tungsten-sort` shuffle manager does not respect `spark.shuffle.spill=false`. This pull request removes these configurations, adds warnings at the appropriate places, and deletes a large amount of code which was only used in code paths that did not support spilling. Author: Josh Rosen <joshrosen@databricks.com> Closes #8831 from JoshRosen/remove-ability-to-disable-spilling.
* [SPARK-10615] [PYSPARK] change assertEquals to assertEqualYanbo Liang2015-09-184-99/+99
| | | | | | | | As ```assertEquals``` is deprecated, so we need to change ```assertEquals``` to ```assertEqual``` for existing python unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8814 from yanboliang/spark-10615.
* [SPARK-10642] [PYSPARK] Fix crash when calling rdd.lookup() on tuple keysLiang-Chi Hsieh2015-09-171-1/+4
| | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-10642 When calling `rdd.lookup()` on a RDD with tuple keys, `portable_hash` will return a long. That causes `DAGScheduler.submitJob` to throw `java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer`. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #8796 from viirya/fix-pyrdd-lookup.
* [SPARK-10282] [ML] [PYSPARK] [DOCS] Add @since annotation to ↵Yu ISHIKAWA2015-09-171-0/+28
| | | | | | | | pyspark.ml.recommendation Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8692 from yu-iskw/SPARK-10282.
* [SPARK-10274] [MLLIB] Add @since annotation to pyspark.mllib.fpmYu ISHIKAWA2015-09-171-1/+9
| | | | | | Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8665 from yu-iskw/SPARK-10274.
* [SPARK-10279] [MLLIB] [PYSPARK] [DOCS] Add @since annotation to ↵Yu ISHIKAWA2015-09-171-2/+26
| | | | | | | | pyspark.mllib.util Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8689 from yu-iskw/SPARK-10279.
* [SPARK-10278] [MLLIB] [PYSPARK] Add @since annotation to pyspark.mllib.treeYu ISHIKAWA2015-09-171-1/+35
| | | | | | Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8685 from yu-iskw/SPARK-10278.
* [SPARK-10281] [ML] [PYSPARK] [DOCS] Add @since annotation to ↵Yu ISHIKAWA2015-09-171-0/+13
| | | | | | | | pyspark.ml.clustering Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8691 from yu-iskw/SPARK-10281.
* [SPARK-10283] [ML] [PYSPARK] [DOCS] Add @since annotation to ↵Yu ISHIKAWA2015-09-171-0/+65
| | | | | | | | pyspark.ml.regression Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8693 from yu-iskw/SPARK-10283.
* [SPARK-10284] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.tuningYu ISHIKAWA2015-09-171-0/+28
| | | | | | Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8694 from yu-iskw/SPARK-10284.
* [SPARK-10276] [MLLIB] [PYSPARK] Add @since annotation to ↵Yu ISHIKAWA2015-09-161-1/+35
| | | | | | | | pyspark.mllib.recommendation Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8677 from yu-iskw/SPARK-10276.
* [SPARK-10516] [ MLLIB] Added values property in DenseVectorVinod K C2015-09-151-0/+4
| | | | | | Author: Vinod K C <vinod.kc@huawei.com> Closes #8682 from vinodkc/fix_SPARK-10516.
* [PYSPARK] [MLLIB] [DOCS] Replaced addversion with versionadded in mllib.randomnoelsmith2015-09-151-1/+1
| | | | | | | | Missed this when reviewing `pyspark.mllib.random` for SPARK-10275. Author: noelsmith <mail@noelsmith.com> Closes #8773 from noel-smith/mllib-random-versionadded-fix.
* [SPARK-10275] [MLLIB] Add @since annotation to pyspark.mllib.randomYu ISHIKAWA2015-09-141-0/+15
| | | | | | Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8666 from yu-iskw/SPARK-10275.
* [SPARK-10273] Add @since annotation to pyspark.mllib.featurenoelsmith2015-09-141-1/+57
| | | | | | | | | | Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings). Added since to methods + "versionadded::" to classes (derived from the git file history in pyspark). Author: noelsmith <mail@noelsmith.com> Closes #8633 from noel-smith/SPARK-10273-since-mllib-feature.
* [SPARK-9793] [MLLIB] [PYSPARK] PySpark DenseVector, SparseVector implement ↵Yanbo Liang2015-09-142-15/+107
| | | | | | | | | | | __eq__ and __hash__ correctly PySpark DenseVector, SparseVector ```__eq__``` method should use semantics equality, and DenseVector can compared with SparseVector. Implement PySpark DenseVector, SparseVector ```__hash__``` method based on the first 16 entries. That will make PySpark Vector objects can be used in collections. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8166 from yanboliang/spark-9793.
* [SPARK-10542] [PYSPARK] fix serialize namedtupleDavies Liu2015-09-143-1/+20
| | | | | | Author: Davies Liu <davies@databricks.com> Closes #8707 from davies/fix_namedtuple.
* [SPARK-10194] [MLLIB] [PYSPARK] SGD algorithms need convergenceTol parameter ↵Yanbo Liang2015-09-142-16/+33
| | | | | | | | | | in Python [SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382) added a ```convergenceTol``` parameter for GradientDescent-based methods in Scala. We need that parameter in Python; otherwise, Python users will not be able to adjust that behavior (or even reproduce behavior from previous releases since the default changed). Author: Yanbo Liang <ybliang8@gmail.com> Closes #8457 from yanboliang/spark-10194.
* [SPARK-6548] Adding stddev to DataFrame functionsJihongMa2015-09-121-18/+18
| | | | | | | | | | | Adding STDDEV support for DataFrame using 1-pass online /parallel algorithm to compute variance. Please review the code change. Author: JihongMa <linlin200605@gmail.com> Author: Jihong MA <linlin200605@gmail.com> Author: Jihong MA <jihongma@jihongs-mbp.usca.ibm.com> Author: Jihong MA <jihongma@Jihongs-MacBook-Pro.local> Closes #6297 from JihongMA/SPARK-SQL.
* [SPARK-9014] [SQL] Allow Python spark API to use built-in exponential operator0x0FFF2015-09-112-1/+14
| | | | | | | | | | | | | | | | | | | This PR addresses (SPARK-9014)[https://issues.apache.org/jira/browse/SPARK-9014] Added functionality: `Column` object in Python now supports exponential operator `**` Example: ``` from pyspark.sql import * df = sqlContext.createDataFrame([Row(a=2)]) df.select(3**df.a,df.a**3,df.a**df.a).collect() ``` Outputs: ``` [Row(POWER(3.0, a)=9.0, POWER(a, 3.0)=8.0, POWER(a, a)=4.0)] ``` Author: 0x0FFF <programmerag@gmail.com> Closes #8658 from 0x0FFF/SPARK-9014.
* [PYTHON] Fixed typo in exception messageIcaro Medeiros2015-09-111-1/+1
| | | | | | | | Just fixing a typo in exception message, raised when attempting to pickle SparkContext. Author: Icaro Medeiros <icaro.medeiros@gmail.com> Closes #8724 from icaromedeiros/master.
* [SPARK-8530] [ML] add python API for MinMaxScalerYuhao Yang2015-09-111-5/+99
| | | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-8530 add python API for MinMaxScaler jira for MinMaxScaler: https://issues.apache.org/jira/browse/SPARK-7514 Author: Yuhao Yang <hhbyyh@gmail.com> Closes #7150 from hhbyyh/pythonMinMax.
* [MINOR] [MLLIB] [ML] [DOC] Minor doc fixes for StringIndexer and MetadataUtilsJoseph K. Bradley2015-09-111-8/+8
| | | | | | | | | | | | Changes: * Make Scala doc for StringIndexerInverse clearer. Also remove Scala doc from transformSchema, so that the doc is inherited. * MetadataUtils.scala: “ Helper utilities for tree-based algorithms” —> not just trees anymore CC: holdenk mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8679 from jkbradley/doc-fixes-1.5.
* [SPARK-9773] [ML] [PySpark] Add Python API for MultilayerPerceptronClassifierYanbo Liang2015-09-111-1/+131
| | | | | | | | Add Python API for ```MultilayerPerceptronClassifier```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8067 from yanboliang/SPARK-9773.
* [SPARK-10026] [ML] [PySpark] Implement some common Params for regression in ↵Yanbo Liang2015-09-114-96/+143
| | | | | | | | | | | | | | | | | PySpark LinearRegression and LogisticRegression lack of some Params for Python, and some Params are not shared classes which lead we need to write them for each class. These kinds of Params are list here: ```scala HasElasticNetParam HasFitIntercept HasStandardization HasThresholds ``` Here we implement them in shared params at Python side and make LinearRegression/LogisticRegression parameters peer with Scala one. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8508 from yanboliang/spark-10026.
* [SPARK-10027] [ML] [PySpark] Add Python API missing methods for ml.featureYanbo Liang2015-09-103-8/+59
| | | | | | | | | | | Missing method of ml.feature are listed here: ```StringIndexer``` lacks of parameter ```handleInvalid```. ```StringIndexerModel``` lacks of method ```labels```. ```VectorIndexerModel``` lacks of methods ```numFeatures``` and ```categoryMaps```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8313 from yanboliang/spark-10027.
* [SPARK-7544] [SQL] [PySpark] pyspark.sql.types.Row implements __getitem__Yanbo Liang2015-09-101-0/+15
| | | | | | | | pyspark.sql.types.Row implements ```__getitem__``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #8333 from yanboliang/spark-7544.
* [SPARK-9772] [PYSPARK] [ML] Add Python API for ml.feature.VectorSlicerYanbo Liang2015-09-091-5/+90
| | | | | | | | Add Python API for ml.feature.VectorSlicer. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8102 from yanboliang/SPARK-9772.
* [SPARK-9654] [ML] [PYSPARK] Add IndexToString to PySparkHolden Karau2015-09-082-5/+72
| | | | | | | | Adds IndexToString to PySpark. Author: Holden Karau <holden@pigscanfly.ca> Closes #7976 from holdenk/SPARK-9654-add-string-indexer-inverse-in-pyspark.
* [SPARK-10094] Pyspark ML Feature transformers marked as experimentalnoelsmith2015-09-081-0/+52
| | | | | | | | Modified class-level docstrings to mark all feature transformers in pyspark.ml as experimental. Author: noelsmith <mail@noelsmith.com> Closes #8623 from noel-smith/SPARK-10094-mark-pyspark-ml-trans-exp.
* [SPARK-10373] [PYSPARK] move @since into pyspark from sqlDavies Liu2015-09-089-25/+23
| | | | | | | | cc mengxr Author: Davies Liu <davies@databricks.com> Closes #8657 from davies/move_since.
* [SPARK-10440] [STREAMING] [DOCS] Update python API stuff in the programming ↵Tathagata Das2015-09-042-0/+29
| | | | | | | | | | | guides and python docs - Fixed information around Python API tags in streaming programming guides - Added missing stuff in python docs Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8595 from tdas/SPARK-10440.
* [SPARK-10417] [SQL] Iterating through Column results in infinite loop0x0FFF2015-09-022-0/+12
| | | | | | | | | | | | | | `pyspark.sql.column.Column` object has `__getitem__` method, which makes it iterable for Python. In fact it has `__getitem__` to address the case when the column might be a list or dict, for you to be able to access certain element of it in DF API. The ability to iterate over it is just a side effect that might cause confusion for the people getting familiar with Spark DF (as you might iterate this way on Pandas DF for instance) Issue reproduction: ``` df = sqlContext.jsonRDD(sc.parallelize(['{"name": "El Magnifico"}'])) for i in df["name"]: print i ``` Author: 0x0FFF <programmerag@gmail.com> Closes #8574 from 0x0FFF/SPARK-10417.
* [SPARK-10392] [SQL] Pyspark - Wrong DateType support on JDBC connection0x0FFF2015-09-012-2/+9
| | | | | | | | | | | | | | | | | | | This PR addresses issue [SPARK-10392](https://issues.apache.org/jira/browse/SPARK-10392) The problem is that for "start of epoch" date (01 Jan 1970) PySpark class DateType returns 0 instead of the `datetime.date` due to implementation of its return statement Issue reproduction on master: ``` >>> from pyspark.sql.types import * >>> a = DateType() >>> a.fromInternal(0) 0 >>> a.fromInternal(1) datetime.date(1970, 1, 2) ``` Author: 0x0FFF <programmerag@gmail.com> Closes #8556 from 0x0FFF/SPARK-10392.
* [SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe filter ↵0x0FFF2015-09-012-10/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | function This PR addresses [SPARK-10162](https://issues.apache.org/jira/browse/SPARK-10162) The issue is with DataFrame filter() function, if datetime.datetime is passed to it: * Timezone information of this datetime is ignored * This datetime is assumed to be in local timezone, which depends on the OS timezone setting Fix includes both code change and regression test. Problem reproduction code on master: ```python import pytz from datetime import datetime from pyspark.sql import * from pyspark.sql.types import * sqc = SQLContext(sc) df = sqc.createDataFrame([], StructType([StructField("dt", TimestampType())])) m1 = pytz.timezone('UTC') m2 = pytz.timezone('Etc/GMT+3') df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain() df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain() ``` It gives the same timestamp ignoring time zone: ``` >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain() Filter (dt#0 > 946713600000000) Scan PhysicalRDD[dt#0] >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain() Filter (dt#0 > 946713600000000) Scan PhysicalRDD[dt#0] ``` After the fix: ``` >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain() Filter (dt#0 > 946684800000000) Scan PhysicalRDD[dt#0] >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain() Filter (dt#0 > 946695600000000) Scan PhysicalRDD[dt#0] ``` PR [8536](https://github.com/apache/spark/pull/8536) was occasionally closed by me dropping the repo Author: 0x0FFF <programmerag@gmail.com> Closes #8555 from 0x0FFF/SPARK-10162.
* [SPARK-9679] [ML] [PYSPARK] Add Python API for Stop Words RemoverHolden Karau2015-09-012-4/+89
| | | | | | | | Add a python API for the Stop Words Remover. Author: Holden Karau <holden@pigscanfly.ca> Closes #8118 from holdenk/SPARK-9679-python-StopWordsRemover.